WO 03/028281 



PCT/CA01/01429 



A Cryptosystem for Data Security 

Field of the Invention 

The present invention relates to the field of data encoding and encryp- 
tion, and in particular, to data encryption in which statistical information 
5 in the "plaintext" is removed in the transformation of the plaintext into 
the "ciphertext". 

Background of the Invention 

The need for sending messages secretly has led to the development of 
the art and science of encryption, which has been used for millennia. Ex- 
10 cellent surveys of the field and the state of the art can be found in Menezes 
et al {Handbook of Applied Cryptography, CRC Press, (1996)), Stallings 
[Cryptography & Network Security: Principles & Practice, Prentice Hall, 
(1998)), and Stinson {Cryptography : Theory and Practice, CRC Press, 
(1995)). 

15 The hallmark of a perfect cryptosystem is the fundamental property 
called Perfect Secrecy. Informally, this property means that for every 
input data stream, the probability of yielding any given output data 
stream is the same, and independent of the input. Consequently, there is 
no statistical information in the output data stream or ciphertext, about 

20 the identity and distribution of the input data or plaintext. 
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The problem of attaining Perfect Secrecy was originally formalized by 
C. Shannon in 1949 (see Shannon (Communication Theory of Secrecy 
Systems, Bell System Technical Journal, 28:656-715, (1949))) who has 
shown that if a cryptosystem possesses Perfect Secrecy, then the length of 
5 the secret key must be at least as large as the plaintext. This restriction 
makes Perfect Secrecy impractical in real-life cry ptosy stems. An example 
of a system providing Perfect Secrecy is the Vernam One-time pad. 

Two related open problems in the fields of data encoding and cryp- 
tography are: 

10 1. Optimizing the Output Probabilities 

There are numerous schemes which have been devised for data com- 
pression/encoding. The problem of obtaining arbitrary encodings of the 
output symbols has been studied by researchers for at least five decades. 
Many encoding algorithms (such as those of Huffman. Fano, Shannon, 

15 the arithmetic coding and others) have been developed using different 
statistical and structure models (e.g. dictionary structures, higher-order 
statistical models and others). They are all intended to compress data, 
but their major drawback is that they cannot control the probabilities 
of the output symbols. A survey of the field is found in Hankerson et al 

20 (Introduction to Information Theory and Data Compression, CRC Press. 
(1998)), Sayood (Introduction to Data Compression, Morgan Kaufmann, 
2nd. edition, (2000)), and Witten et al. (Managing Gigabytes: Com- 
pressing and Indexing Documents and Images, Morgan Kaufmann, 2nd. 
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edition, (1999)). 

Previous schemes have a drawback, namely that once the data com- 
pression/encoding method has been specified, the user loses control of 
the contents and the statistical properties of the compressed/encoded 

5 file. In other words, the statistical properties of the output compressed 
file is outside of the control of the user 7 they take on their values as a 
consequence of the statistical properties of the uncompressed file and the 
data compression method in question. 

A problem that has been open for many decades (see Hankerson et al 

10 {Introduction to Information Theory and Data Compression, CRC Press, 
(1998) pp. 75-79)), which will be referred to herein as the Distribution 
Optimizing Data Compression (or c: DODC") problem, (or in a more 
general context of not just compressing the plaintext, it will be referred 
to herein as the Distribution Optimizing Data Encoding (or "DODE") 

15 problem), consists of devising a compression scheme, which when applied 
on a data file, compresses the file and simultaneously makes the file 
appear to be random noise. The formal definition of this problem is found 
in Appendix A. If the input alphabet is binary, the input probability of 
'0' and T could be arbitrary, and is fully dictated by the plaintext. The 

20 problem of the user specifying the output probability of '0' and T in the 
compressed file has been considered an open problem. Indeed, if the user 
could specify the stringent constraint that the output probabilities of s 0' 
and T be arbitrarily close to 0.5, the consequences are very far-reaching, 
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resulting in the erasure of statistical information. 

2. Achieving Statistical Perfect Secrecy 

The problem of erasing the statistical distribution from the input data 
stream and therefore the output data stream, has fundamental signifi- 
5 cance in cryptographic applications. It is well known that any good 
cryptosystem should generate an output that has random characteris- 
tics (see Menezes et ai (Handbook of Applied Cryptography, CRC Press ; 
(1996)), Stallings (Cryptography & Network Security: Principles & Prac- 
tice, Prentice Hall, (1998)), Stinson (Cryptography : Theory and Practice, 

10 CRC Press, (1995)), and Shannon (Communication Theory of Secrecy 
Systems, Bell System Technical Journal, 28:656-715, (1949))). 

A fundamental goal in cryptography is to attain Perfect Secrecy (see 
Stinson, Cryptography : Theory and Practice, CRC Press, (1995))). 
Developing a pragmatic encoding system that satisfies this property 

15 is an open problem that has been unsolved for many decades. Shannon 
(see Menezes (Handbook of Applied Cryptography, CRC Press, (1996)), 
Stallings Cryptography & Network Security: Principles & Practice, Pren- 
tice Hall, (1998)), Stinson (Cryptography : Theory and Practice, CRC 
Press, (1995)), and Shannon (Communication Theory of Secrecy Sys- 

20 terns, Bell System Technical Journal, 28:656-715, (1949))) showed that 
if a cryptosystem possesses Perfect Secrecy, then the length of the secret 
key must be at least as large as the Plaintext. This makes the devel- 
opment of a realistic perfect secrecy cryptosystem impractical, such as 
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demonstrated by the Vernam One-time Pad. 

Consider a system in which X — x[l] . . .x[M] is the plaintext data 
stream, where each x[k] is drawn from a plaintext alphabet, S = {$j . . . s m } i 
and y = y[l] . . . y[R] is the ciphertext data stream, where each y[k] £ A 
5 of cardinality r. 

Informally speaking, a system (including cry ptosys terns, compression 
systems, and in general, encoding systems) is said to possess Statistical 
Perfect Secrecy if all its contiguous output sequences of length k are 
equally likely, for all values of k, independent of X. Thus, a scheme 
10 that removes all statistical properties of the input stream also has the 
property of Statistical Perfect Secrecy. A system possessing Statistical 
Perfect Secrecy maximizes the entropy of the output computed on a 
symbol-wise basis. 

More formally, a system is said to possess Statistical Perfect Secrecy if 
15 for every input A'there exists some integer jo > 0 and an arbitrarily small 
positive real number 5q such that for all j > jo, Pi[yj + i...yj+k\X] = ^ 
± 8 0 for all k } 0. < k < R - jf 0 - 

A system possessing this property is correctly said to display Statistical 
Perfect Secrecy. This is because for all practical purposes and for all 
20 finite-lengthed subsequences, statistically speaking, the system behaves 
as if it. indeed, possessed the stronger property of Perfect Secrecy. 

It is interesting to note that Statistical Perfect Secrecy is related to 
the concept of Perfect Secrecy. However, since the property of Statistical 
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Perfect Secrecy can characterize any system and not just a cryptosystem, 
there is no requirement of relating the size of the key to the size of the 
input, as required by Shannon's theorem. 

Summary of the Invention 

5 The object of the present invention is to provide an improved method 
of encoding and decoding data, which permits the user to specify certain 
statistical parameters of the ciphertext, and to control or remove statisti- 
cal information during the process of encoding plaintext into ciphertext. 
It is a further object of an embodiment of the present invention to pro- 
10 vide an improved method of encoding and decoding data, which permits 
the user to obtain Statistical Perfect Secrecy in the ciphertext. 

It is a further object of an embodiment of the present invention to pro- 
vide an improved method of encoding and decoding data, which ensures 
that Statistical Perfect Secrecy in the ciphertext has been obtained.. 
15 It is a further object of an embodiment of the present invention to 
provide an improved method of encryption and decryption. 

It is a further object of an embodiment of the present invention to 
provide an improved method of steganography. 

It is a further object of an embodiment of the present invention to 
20 provide an improved method of secure communications! 

It is a further object of an embodiment of the present invention to 
provide an improved method of secure data storage and retrieval. 
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Accordingly, in one aspect, the present invention relates to a method 
for creating ciphertext from plaintext comprising the steps of: (a) re- 
ceiving a character of plaintext; (b) traversing an Oomrnen-Rueda Tree 
between the root and that leaf corresponding to that character of plain- 

5 text and recording the Assignment Value of each branch so traversed; (c) 
receiving a next character of plaintext; and repeating steps (b) and (c) 
until the plaintext has been processed. 

In another aspect, the present invention relates to a method for cre- 
ating ciphertext from plaintext comprising the steps of: (a) creating an 

10 Oommen-Rueda Tree; (b) receiving a character of plaintext; (c) travers- 
ing the Oommen-Rueda Tree between the root and that leaf correspond- 
ing to that character of plaintext and recording the Assignment Value of 
each branch so traversed; (d) receiving a next character of plaintext; and 
(e) repeating steps (c) and (d) until the plaintext has been processed. 

15 In another aspect, the present invention relates to a method for cre- 
ating ciphertext from plaintext comprising the steps of: (a) receiving an 
Oommen-Rueda Tree; (b) receiving a character of plaintext; (c) travers- 
ing the Oommen-Rueda Tree between the root and that leaf correspond- 
ing to that character of plaintext, and recording the Assignment Value of 

20 each branch so traversed; (d) receiving a next character of plaintext; and 
(e) repeating steps (c) and (d) until the plaintext has been processed. 

In another aspect, the present invention relates to a method for cre- 
ating ciphertext from plaintext comprising the steps of: (a) creating an 
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Oommen-Rueda Tree, which Oommen-Rueda Tree has leaves associated 
with the members of the alphabet' of the plaintext, each member of the 
alphabet of the plaintext being associated with at least one leaf, which 
Oommen-Rueda Tree's internal nodes each have at least one branch de- 

5 pending therefrom, which Oommen-Rueda Tree branches have associated 
therewith an Assignment Value, which Assignment Value is associated 
with a member of the alphabet of the ciphertext, which Oommen-Rueda 
Tree's nodes each have associated therewith a quantity related to the fre- 
quency weight (b) receiving a first character of plaintext; (c) traversing 

10 ' the Oommen-Rueda Tree between the root and that leaf corresponding 
to that character of plaintext and recording the Assignment Value of each 
branch so traversed; (d) receiving the next symbol of plaintext; and (e) 
repeating steps (c) and (d) until theplaintext has been processed. 
In another aspect, the present invention relates to a method for cre- 

15 ating ciphertext from plaintext comprising the steps of: (a) receiving an 
Oommen-Rueda Tree, which Oommen-Rueda Tree has leaves associated 
with the members of the alphabet of the plaintext, each member of the 
alphabet of the plaintext being associated with at least one leaf, which 
Oommen-Rueda Tree's internal nodes each have at least one branch cle- 

20 pending therefrom, which Oommen-Rueda Tree branches have associated 
therewith an Assignment Value, which Assignment Value is associated 
with a member of the alphabet of the ciphertext, which Oommen-Rueda 
Tree's nodes each have associated therewith a quantity related to the 
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frequency weight of each of the nodes and leaves dependant therefrom; 
(b) receiving a first character of plaintext; (c) traversing the Oommen- 
Rueda Tree between the root and that leaf corresponding to that char- 
acter of plaintext and recording the Assignment Value of each branch so 

5 traversed; (d) receiving the next symbol of plaintext; and (e) repeating 
steps (c) and (d) until the plaintext has been processed. 

In another aspect, the present invention relates to a method of decod- 
ing ciphertext, comprising the steps of: (a) receiving a first character of 
ciphertext; (b) utilizing an Oommen-Rueda Tree having a structure cor- 

10 responding to the Oommen-Rueda Tree initially utilized by the encoder 
and utilizing the same Branch Assignment Rule as utilized by the encoder 
to provide the Assignment Values for the branches depending from the 
root, traversing such Oommen-Rueda Tree from the root towards a leaf, 
the first symbol character of ciphertext determining the branch to then 

15 be traversed; (c) if a leaf has not been reached, utilizing the same Branch 
Assignment Rule as utilized by the encoder to provide Assignment Val- 
ues for the branches depending from the node that has been reached, 
receiving the next character of ciphertext, and continuing to traverse the 
Oommen-Rueda Tree from the node that has been reached towards a 

20 leaf, the current symbol of ciphertext determining the branch to then be 
traversed; (d) when a leaf is reached, recording the plaintext character 
associated with the label of the leaf, the root becoming the node that has 
been reached for the purpose of further processing; (e) repeating steps 
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(c) and (d) until all symbols of ciphertext have been processed. 

In another aspect, the present invention relates to a method of decod- 
ing ciphertext, comprising the steps of: (a) creating an Oommen-Rueda 
Tree (b) receiving a first character of ciphertext; (c) utilizing an Oommen- 

5 Rueda Tree having a structure corresponding to the Oommen-Rueda Tree 
initially utilized by the encoder and utilizing the same Branch Assign- 
ment Rule as utilized by the encoder to provide the Assignment Values for 
the branches depending from the root, traversing such Oommen-Rueda 
Tree from the root towards a leaf, the first character of ciphertext de- 

10 termining the branch to then be traversed; (d) if a leaf has not been 
reached, utilizing the same Branch Assignment Rule as utilized by the 
encoder to provide Assignment Values for the branches depending from 
the node that has been reached, receiving the next character of cipher- 
text, and continuing to traverse the Oommen-Rueda Tree from the node 

15 that has been reached towards a leaf, the current symbol of ciphertext 
determining the branch to then be traversed; (e) when a leaf is reached, 
recording the plaintext character associated with the label of the leaf, the 
root becoming the node that has been reached for the purpose of further 
processing; repeating steps (d) and (e) until all symbols of ciphertext 

20 have been processed. 

In another aspect, the. present invention relates to a method of decod- 
ing ciphertext, comprising the steps of: (a) receiving an Oommen-Rueda 
Tree (b) receiving a first character of ciphertext; (c) utilizing an Oommen- 
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Rueda Tree having a structure corresponding to the Oommen-Rueda Tree 
initially utilized by the encoder and utilizing the same Branch Assign- 
ment Rule as utilized by the encoder to provide the Assignment Values for 
the branches depending from the root, traversing such Oommen-Rueda 

5 Tree from the root towards a leaf, the first character of ciphertext de- 
termining the branch to then be traversed; (d) if a leaf has not been 
reached, utilizing the same Branch Assignment Rule as utilized by the 
encoder to provide Assignment Values for the br anches depending from 
the node that has been reached, receiving the next character of cipher- 

10 text, and continuing to traverse the Oommen-Rueda Tree from the node 
that has been reached towards a leaf, the current symbol of ciphertext 
determining the branch to then be traversed; (e) when a leaf is reached, 
recording the plaintext character associated with the label of the leaf, the 
root becoming the node that has been reached for the purpose of further 

15 processing; repeating steps (d) and (e) until all symbols of ciphertext 
have been processed. 

The advantage of an embodiment of the present invention is that it 
provides a method of encoding and decoding data, which encoded data 
has the Statistical Perfect Secrecy property. 

20 A further advantage of an embodiment of the present invention is 
that it guarantees that the encoded message has the Statistical Perfect 
Secrecy property. 

A further advantage of an embodiment of the present invention is 
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that it can be adapted to simultaneously provide optimal and lossless 
compression of the input data stream in a prefix manner. 

A further advantage of an embodiment of the present invention is that 
it can also be adapted to simultaneously provide an output of the same 
5 size or larger than the input data stream in a prefix manner. 

A further advantage of an embodiment of the present invention is that 
it provides an improved method of encryption and decryption. 

A further advantage of an embodiment of the present invention is that 
it provides an improved method of steganography. 
10 A further advantage of an embodiment of the present invention is that 
it provides an improved method of secure communications. 

A further advantage of an embodiment of the present invention is that 
it provides an improved method of secure data storage and retrieval. 

Brief Description of the Figures 

15 Figure 1 presents an Oommen-Rueda Tree in which the input alphabet 
S = {a, 6, c, d, e, /} with probabilities V = {0.1, 0.15, 0.27, 0.2, 0.05, .23}, 
and the output alphabet is A = {0, 1}. The root points to two children. 
Each node stores a weight, which is the sum of the weights associated with 
its children. It also stores the ordering of the children in terms of their 

20 weights, i.e. , whether the weight of the left child is.greater than that of the 
right child, or vice versa. Although, in this example, the encoding does 
not achieve optimal data compression, using DODE, arbitrary output 
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probabilities can be achieved. 

Figure 2 presents a Huffman tree constructed using Huffman's algo- 
rithm with S = {a, 6, c, d} } V = {0.4, 0.3, 0.2, 0.1}, and A = {0, 1}. 

Figures 3 and 4 present a schematic diagram showing the Process 
5 D_S_HJB m> 2 used to encode an input data sequence. The input to the 
process is the Static Huffman Tree, T, and the source sequence, X. The 
output is the sequence, y. It is assumed that there is a hashing function 
which locates the position of the input alphabet symbols as leaves in T. 

Figures 5 and 6 present a schematic diagram showing the Process 
10 D„S JELD m/ 2 used to decode a data sequence encoded by following Process 
D_S_H_E mj 2. The input to the process is the static Huffman Tree, T, and 
the encoded sequence, y. The output is the decoded sequence, which is, 
indeed, the original source sequence, X. 

Figure 7 depicts the two possible labeling strategies that can be done 
15 in a Huffman tree constructed for Process D_SJHJE 2j 2 ? where S = {0, 1}, 
and V = [p, 1 - p] with p > 0.5. 

Figure 8 graphically plots the average distance obtained after running 
D_S_HJE mj2 on file bib from the Calgary Corpus, where /* = 0.5 and 

e = 2. 

20 Figure 9 and 10 present a schematic diagram showing the Process 
D-A_H_E m> 2 used to encode an input data sequence. The input to the 
process is an initial Adaptive Huffman Tree, T, and the source sequence, 
X. The output is the sequence, y. It is assumed that there is a hashing 
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function which locates the position of the input alphabet symbols as 
leaves in T. 

Figure 11 and 12 present a schematic diagram showing the Process 
D_A Ji_D mi 2 used to decode a data sequence encoded by following Process 
5 D_A Ji_E m> 2. The input to the process is the same initial Adaptive Huff- 
man Tree, T used by Process D_A_H_E mj 2, and the encoded sequence, 
y. The output is the decoded sequence, which is, indeed, the original 
source sequence, X. 

Figure 13 graphically plots the average distance obtained after running 
10 D_A_H_E m) 2 on file bib from the Calgary Corpus, where /* = 0.5 and 
0-2. 

Figures 14 and 15 present a schematic diagram showing the Process 
RVJ^_HJE mi 2 used to encode an input data sequence. The input to the 
process is an initial Huffman Tree, T, and the source sequence, X. The 

15 output is the sequence, 3 ; - It is assumed that there is a hashing function 
which locates the position of the input alphabet symbols as leaves in T- 
Figure 16 demonstrates how the decision in the Branch Assignment 
Rule in Process RV_A_H JE mj 2 is based on the value of f(n) and a pseudo- 
random number, a. 

20 Figure 17 and 18 present a schematic diagram showing the Process 
RV_A_H_D m3 2 used to decode a data sequence encoded by following Pro- 
cess RV_AJHJE m> 2. The input to the process is the same initial Adaptive 
Huffman Tree, T used by Process RV_AJi _E mj 2, and the encoded se- 
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quence, y. The output is the decoded sequence, which is, indeed, the 
original source sequence, X . 

Figure 19 demonstrates how the decision in the Branch Assignment 
Rule in Process RR_A_H_E m>2 is based on the value of f(n) and two 
5 pseudo-random number invocations. 

Figure 20 graphically plots the average distance obtained after running 
Process RV _A JH_E mt 2 on file bib from the Calgary Corpus, where /* = 0.5 
and 0 = 2. 

Figure 21 graphically plots the average distance obtained after running 
10 Process RV_A_H_E m o on file bib from the Calgary Corpus, where /* = 0.5 
and 0 = 2, and where the transient behavior has been eliminated. 

Figure 22 graphically plots the average distance obtained after running 
Process RR_AJH_E m> 2 on file bib from the Calgary Corpus, where /* = 
0.5 and 0 = 2, and where the transient behavior has been eliminated. 
15 Figure 23 displays the original carrier image, the well-known Lena im- 
age, and the resulting image after applying steganographic techniques to 
the output of the file fields, c from the Canterbury corpus. The stegano- 
graphic method used is a fairly simplistic one, but includes the message 
encrypted as per Process RV_A_HJ5 m> 2- 

20 Description of the Preferred Embodiments 

The invention operates on a data structure referred to as the Oommen- 
Rueda Tree defined as below for a plaintext alphabet of cardinality m 
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and an ciphertext alphabet of cardinality r, where the desired output 
frequency distribution is T*, and its measured (estimated) value is f. 

The Oommen-Rueda Tree 

An optimized Oommen-Rueda Tree is defined recursively as a set of 
5 Nodes, Leaves and Edges as follows: 

1. The tree is referenced by a pointer to the tree's root - which is a 
unique node, and which has no parents. 

2. Every node contains a weight which is associated with the sum of 
the weights associated with its children (including both nodes and 

10 leaves). 

3. In a typical Oommen-Rueda Tree, every node has r edges (or branches), 
each edge pointing to one of its r children, which child could be a 
node or a leaf. However, if any node has less. than r children, it can 
be conceptucilly viewed as having exactly r children where the rest 

15 of the children have an associated weight of zero. 

4. Every node maintains an ordering on its children in such a way that 
it knows how the weights of the children are sorted. 

5. Each plaintext alphabet symbol is associated with a leaf (where a 
leaf is a node that does not have any children). The weight as- 

20 sociated with each leaf node is associated with the probability (or 



16 



SUBSTITUTE SHEET (RULE 26) 



WO 03/028281 



PCT/CAO I/0 1429 



an estimate of the probability) of the occurrence of the associated 
plaintext symbol. 

6. A hashing function is maintained which maps the source alphabet 
symbols to the leaves, thus ensuring that when an input symbol 

5 is. received, a path can be traced between the root and the leaf 
corresponding to that input symbol. An alternate approach is to 
search the entire tree, although this would render the Oommen- 
Rueda Process less efficient. 

7. If the input symbol probabilities are known a priori, the Oommen- 
10 Rueda Tree is maintained statically. Otherwise, it is maintained 

adaptively in terms of the current estimates of the input probabili- 
ties, by re-ordering the nodes of the tree so as to maintain the above 
properties. 

Specific instantiations of the Oommen- Rueda Tree are the Huffman 
15 tree, and the Fano tree, in which the ordering of the nodes at every 
level obeys a sibling-like property (see Hankerson et al [Introduction to 
Information Theory and- Data Compression, GRC Press, (1998))). 

It is to be understood that the Oommen-Rueda Trees referred to 
herein, include those less optimal versions of the above-defined Oommen- 
20 Rueda Tree which can be designed by introducing nodes, leaves and edges 
which have no significant contribution, or whose significance can be min- 
imized by merging them with other nodes, leaves and edges respectively, 
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resulting in a Oommen-Rueda Tree -with a lesser number of nodes, leaves 
and edges respectively. 

It is further to be understood that if the sibling property is not main- 
tained in a left-to-right order, the corresponding assignments will not be 
5 made in a left-to-right order. 

It is to be understood that the term "plaintext" encompasses all man- 
ner of data represented symbolically and includes, but is not limited to, 
data utilized in the processing of audio data, speech, music, still images, 
video images, electronic mail, internet communications and others. 
10 Finally, it is to be understood that the combination of the Oommen- 
Rueda Tree with the encoding and decoding processes described herein, 
will provide, a solution of the Distribution Optimizing Data Encoding 
problem in which the ouput may be compressed, of the same size, or 
even expanded. 

15 The Encoding Process Utilizing Oommen-Rueda Trees 

The encoding process related to Oommen-Rueda Trees involves travers- 
ing (in accordance with a hashing function maintained for the Oommen- 
Rueda Tree) the Oommen-Rueda Tree between the root and the leaf 
corresponding to the current plaintext symbol, and recording the branch 
20 assignment of the edges so traversed, which branch assignment is de- 
termined in accordance with the Branch Assignment Rule. The actual 
encoding of the current plaintext symbol is done by transforming the 
plaintext input symbol into a string associated with the labels of the 
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edges (determined as per the Branch Assignment Rule) on the path from 
the root to the leaf associated with that plaintext symbol. 

It is understood that the traversal of the Oommen-Rueda Tree (as dic- 
tated by the Hashing function) in the Encoding Process can be achieved 
5 either in root-to-leaf manner, or in a leaf-to-root manner, in which latter 
case the output stream is reversed for that particular plaintext symbol. 

The Branch Assignment Rule 

1. This Rule determines the assignment of labels on the edges of the 
tree, the labels being associated with the symbols of the output 

10 alphabet. 

2. The assignment can be chosen either deterministically, or randomly 
involving a fixed constant, or randomly by invoking random vari- 
ables whose distributions do not depend on the current measurement 
(or estimate) , T , or randomly by invoking at least one random vari- 

15 able whose distribution depends on the current measurement (or 
estimate), T . 

3. The assignments can be made so as to converge to the predeter- 
mined value of JF*. Additionally, the Branch Assignment Rule can 
be designed so that the Process converges to a probability distribu- 

20 tion which simultaneously maintains the independence of the output 
symbols. This can be achieved by utilizing the Markovian history, 
for example, using / 0 | 0) and / 0 |1 to converge to /*| 0 , and respec- 
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tively, when the output alphabet is binary. 
The Tree Restructuring Rule 

1. If the Oommen-Rueda Tree is updated adaptively, the updating of 
the Oommen-Rueda Tree is done after the processing of each plain- 

5 text symbol has been affected. For example, in the case of the Huff- 
man tree, the updating is achieved by using the technique described 
by Knuth (Dynamic Huffman Coding, Journal of Algorithms, 6:163- 
180, (1985)), or Vitter (Design and Analysis of Dynamic Huffman 
Codes, Journal of the ACM, 34(4):825-845, (1987)), wherein the 

10 tree is updated so as to maintain the sibling property at every level 
of the tree. 

2. For other instantiations of the Oommen-Rueda Trees, it is to be 
understood that the Tree Restructuring Rule can be affected by ei- 
ther re-creating the tree using an exhaustive search of all possible 

15 Oommen-Rueda Trees characterizing that instantiation, or by incre- 
mentally modifying them based on the properties of the particular 
instantiation. 

The Decoding Process Utilizing Oommen-Rueda Trees 

The decoding process assumes that the Decoder can either create or 
20 is provided with an Oommen-Rueda Tree, which Tree is identical to the 
Oommen-Rueda tree utilized by the Encoder to encode the any pre- 
specified plaintext symbol The actual decoding process involves the 
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steps of : 

1. Traversing the Decoder's copy of the Oommen-Rneda Tree from the 
root to a leaf, the traversal being determined by the same Branch 
Assignment Rule utilized by the Encoder, and the ciphertext sym- 

5 bols. 

2. The label of the leaf is recorded as the decoded symbol. 

3. If the Oommen-Rueda Tree is updated adaptively, the updating of 
the Decoder's Oommen-Rueda Tree is done as per the Tree Restruc- 
turing Rule described above, after the source symbol associated with 

10 the current encoded string has been determined. This ensures that 
after the decoding of each plaintext symbol has been affected, both 
the Encoder and Decoder maintain identical Oommen-Rueda Trees. 

Tt is to be understood that if the Encoder utilized a process that in- 
volved a Branch Assignment Rule, the same Branch Assignment Rule 

15 must also be utilized by the Decoder. Furthermore, if the Encoder uti- 
lized a process that did not involve a Branch Assignment Rule, the De- 
coder's process must also not involve a Branch Assignment Rule. 

It is further to be understood that if the Encoder utilized a process 
that involved a Tree Restructuring Rule, the same Tree Restructuring 

20 Rule must also be utilized by the Decoder. Furthermore, if the Encoder 
utilized a process that did not involve a Tree Restructuring Rule, the 
Decoder's process must also not involve a Tree Restructuring Rule. 
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Additional Instantiations of the Invention 

The Oommen-Rueda Tree may be used in a wide range of circum- 
stances and is flexible to the needs of the user. 

If the occurrence probabilities of the plaintext input symbols are 
5 known a priori the tree (and consequently, the associated processes) 
are Static. On the other hand, if the occurrence probabilities of the 
input symbols are not known a phon, the tree (and consequently, the 
associated processes) are Adaptive. 

As indicated above, this invention also utilizes an Branch Assignment 
10 Rule for the labels of the edges of the Oommen-Rueda Tree. A process 
manipulating the tree is termed Deterministic or Randomized depending 
on whether the Branch Assignment Rule is Deterministic or Randomized 
respectively, which is more fully described below for the specific instan- 
tiations. 

15 The various embodiments of the invention are specified herein by 
first stating whether the Branch Assignment Rule is (D)eterministic or 
(R)andomized. In the case of a randomized Branch Assignment Rule, 
the effect of randomization on the branch assignment can be either 
determined by a comparison with (F)ixed constant, a (V)ariable or a 
20 (R)andom variable itself, 

The second specification of a process details whether the tree is created 
in a (S)tatic or (A)daptive manner. 

The third specification of the embodiment details the specific instan- 
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tiation of the Oommen-Rueda Tree (that is, whether the generalized 
Oommen-Rueda Tree is being used, or " a specific instantiation such as 
the Huffman or Fano tree is being used) . 

The fourth specification of a process informs whether it is an Encoding 
5 or Decoding Process. 

The last specification are the cardinalities of the input and output 
alphabets. 

The following sections describe specific embodiments of the invention 
for the case when the size of the ciphertext alphabet is 2. It is easily seen 

10 that when the size of the ciphertext alphabet is r, the corresponding 
r~ary embodiment can be obtained by concatenating log 2 r bits in the 
output and causing this binary string of length log^r to represent a single 
symbol from the output r-ary alphabet. When r is not a power of two, 
r symbols can be composed with probability fractl.r, thus ignoring 

15 the other strings of length log2r y implying that this method assigns a 
probability value of zero for the strings of length logyr that have been 
ignored. 

Using the above nomenclature, the following are some specific embod- 
iments of the Invention: 
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Throughout this document, unless otherwise specified, DDODE (for 
Deterministic Distribution Optimizing Data Encoding) will be used as a 
generic name for any Deterministic solution that yields Statistical Perfect 
Secrecy. Specific instantiations of DDODE are D_S_HJE m)2 , D_A_H_E W)2 
5 etc. Similarly, RDODE (for Randomized Distribution Optimizing Data 
Encoding)\v\\\ be used as a generic name for any Randomized solution 
that yields Statistical Perfect Secrecy. Specific instantiations of RDODE 
are RF_S_H_E m , 2 , RV_A_H_E m , 2 etc. 
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D JSJHJE m> 2 : The Deterministic Embodiment using 
Static Huffman Coding 

The Encoding Process : D_S_H_E mj2 

This section describes the process that provides a method of opti- 

5 mizing the distribution of the output by using the sibling property of 
Huffman trees (see Hankerson et al (Introduction to Information The- 
ory and Data Compression, CRC Press, (1998)), Sayood (Introduction to 
Data Compression, Morgan Kaufmann, 2nd. edition, (2000)), and Wit- 
ten et al. (Managing Gigabytes: Compressing and Indexing Documents 

10 and Images, Morgan Kaufmann, 2nd. edition, (1999))), and a determin- 
istic rule that changes the labeling of the encoding scheme dynamically 
during the encoding process. 

This document uses the notation that f(n) represents the estimate 
of the probability of 0 in the output at time c n', and f(n) refers to the 

15 actual probability of 0 in the output at time V. The estimation of f(n) 
uses a window that contains the last t symbols generated at the output, 
and f(n) is estimated as where cq(1) is the number of 0 5 s in the 
window. In the included analysis and implementation, the window has 
been made arbitrarily large by considering t = n, i.e. the window is the 

20 entire output sequence up to time V. The effect of the window size, l t\ 
will be discussed later.. 

The process D_SJLE mj 2 is formally given below and pictorially in 
Figures 3 and 4. 
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Schematic of D_S_H JB m , 2 

The schematic chart of Process DJ3JHJE mi 2 is explained here. The 
figures begin with and Input/Output block (block 100) in which (i) the 
source sequence, X - x[l) . . .x[M] } is the data to be encoded, (ii) r 

5 is the root of the Huffman tree that has been created using the static 
Huffman coding approach (see Huffman (A Method for the Construction 
o Minimum Redundancy Codes/ Proceedings of IRE, 40(9), pp. 1098- 
1101, (1952))), (hi) h(s) is the hashing function that returns the pointer 
to the node associated with any plaintext symbol «s, and (iv) /* is the 

10 requested probability of ( 0 5 in the output. 

In block 110, the estimated probability of '0 7 in the output, /, is 
initialized to 0, and the counter of zeros in the output, Co, is initialized 
to 0 as well. The counter of bits in the output, n, is initialized to 1, and 
the counter of symbols read from the input is initialized to 1 too. Other 

15 straightforward initializations of these quantities are also possible. 

A decision block is then invoked in block 120 which constitutes the 
starting of a looping structure that is executed from 1 to the number 
of symbols in the input, M. In block 130, a variable, g, which keeps 
the pointer to the current node being inspected in the tree, is initialized 

20 to the node associated with the current symbol, x[k], by invoking the 
hashing function h(x[k]). The length of the code word associated with 
x[k], I, is initialized to 0. The decision block 140 constitutes the starting 
point of a looping that traces the path from the node associated with x[k] 
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to the root of the tree, r. Block 150 increments the length of the code 
word associated with x[k] by 1, since this length is actually the number of 
edges in that path. The decision block 160 tests if the corresponding path 
goes through the left or right branch, true for left (block 170) and false 

5 for right (block 180). In block 185, the current pointer, g, moves up to its 
parent towards the root of the tree. Connector 190 is the continuation of 
the branch "No" of the decision block 140, i.e. when q reaches the root 
of the tree (q = r). Connector 200 is the continuation of the decision 
"No" of block 120, which occurs when all the symbols of the input have 

10 been read. Connector 210 is the continuation of block 310 and enters in 
the decision block 120 to process the next symbol from the input. 

In block 220, j is initialized to the length of the current path so as to 
process that path from the root of the tree towards the leaf associated 
with x[k]. Block 230 involves a decision on whether or not each node 

15 in the path has been processed. The decision block 240 compares the 
estimated probability of '0' in the output, /, to the probability of '0' 
requested by the user, /*, so as to decide on the branch assignment 
strategy to be applied. The branch "Yes" leads to block 250, which test 
if the path goes to the left (true) or to the right (false). The branch 

20 "Yes" of block 250 goes to block 280 in which a £ 0' is sent to the output, 
and the corresponding counter, Co, is thus increased. Block 280 is also 
reached after deciding "No" in block 260, which implies that / > /*, 
and hence the branch assignment 1-0 has been used. The branch "No" 
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of block 250, where the branch assignment 0-1 is applied and the path 
goes to the right, leads to block 270 in which a T is sent to the output. 
Block 270 is also reached by the branch "No" of block 250 in which the 
0-1 branch assignment is used and the path goes to the right. 

5 In block 290, the estimated probability of '0' in the output is updated 
using the current counter of zeros and the current value of n, and the 
counter of bits in the output, n, is incremented. Block 300 decrements 
the counter of edges in the path, and goes to the decision block 230, 
which decides "No" when the node associated with x[k] is reached. In this 

10 case, block 310 is reached, where the counter of symbols in the input, 
k, is increased and the process continues with the decision block 120 
which ends the looping (decides "No") when all the symbols of the input 
have been processed, reaching the Input/Output block 320 in which the 
encoded sequence, y = y[l]...y[R], is stored. Block 330 returns the 

15 control to the "upper-level" process which invoked Process D_S_H_E mi 2, 
and the process D_S_HJB mi2 terminates. 
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Process D.S_H.E mj2 
Input: The Huffman Tree, T. The source sequence,. X. The re- 
quested probability of 0, /*. 

Output: The output sequence, y. 
5 Assumption: It is assumed that there is a hashing function which 
locates the position of the input alphabet symbols as leaves in T. 
Method: 

Cd(O) 4- 0; n <- 1; /(0) «- 1 

for fc «— 1 to M do // For a// the symbols of the input sequence 
10 Find the path for 

q <— root(T) 
while g is not a iea/ do 
if /(^) < /* then // Assignment 0-1 
if pa£/i is "left" then 
15 y[n] 0; c 0 (n) 4- c 0 (7z - 1) + 1; 9 <- left(g) 

else 

j/[n] 1; 5 right(g) 
endif 

else // Assignment 1-0 
20 if pa£/& is "left" then 

y[n]'*-l;g<-left(g) 
else 

y[n] «- 0; c 0 (n) <- Cq(?i - 1) + 1; q right(g) 
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. endif 
endif 

f(n) <- *® 

72 n + 1 
endwhile 
endfor 
End Process D_S_H_E m2 
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Rationale for the Encoding Process 

In the traditional Huffman encoding, after constructing the Huffman 
tree in a bottom-up manner, an encoding scheme is generated which is 
used to replace each symbol of the source sequence by its corresponding 

5 code word. A labeling procedure is performed at each internal node, 
by assigning a code alphabet symbol to each branch. Different labeling 
procedures lead to different encoding schemes, and hence to different 
probabilities of the output symbols. In fact, for a source alphabet of m 
symbols there are 2 m_1 such encoding schemes. It is easy to see (as from 

10 the following example) how the labeling scheme can generate different 
probabilities in the output. 

Example 1. Consider S = {a,6,c,d}, V = {0.4,0.3,0.2,0.1}, A = 
{0,1}, and the encoding schemes <f> A : S -> C A = {1,00,010,011} and 
4>b : S -> Cb = {0,10,110,111}, generated by using different labeling 
15 strategies on the Huffman tree depicted in Figure 1. The superior en- 
coding scheme between <p A and </>s, such that T* = {0.5,0.5}, is sought 
for. 

First, the average code word length of bits per symbol is calculated 
(see Hankerson et al (Introduction to Information Theory and Data 
20 Compression, CRC Press, (1998))) for the exact formula for computing 
it) for both encoding schemes: t A = l B = 1(0.4) + 2(0.3) + 3(0.2) + 
3(0.1) - 1.9. 

Second, the distance for (j) A and 4>B from the optimal distribution, 
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(Ia{FaiF*) and d B (F B ,F*) respectively, are calculated, by using (15) 
where 0 = 2, as follows: 

f QA = 0(Q^)4-2(Q.3H2(Q.2) + l(0.1) ^ Q.57895 , 
f u = 1(0^)4-0(0.^1(0.2)^(0.1) ^ q 421Q5 ^ 

5 ^(-Fa, ^) = 10.57895 - 0.5| 2 + (0.42105 - 0.5| 2 = 0.012466 , 

f 0a = i(Q^+i(Q-3Hi(Q.2)+o(o.i) „ 0.47368 , 
= o(o.4)-n(o.3K2(o.2 )+ 3(o.i) „ q. 52632 , and 

d B {T B ,T) = |0.47368 - 0.5| 2 + |0.52632 - 0.5| 2 = 0.001385 . 

Third, observe that Ca and C B are prefix, l A and are minimal, and 
10 d B < dA- Therefore, (f) B is better than $ A for J** = {0.5 3 0.5}. □ 

In order to solve the DODG Problem, a "brute force" technique is pro- 
posed in Hankerson et al (Introduction to Information Theory and Data 
Compression, CRC Press, (1998)), which searches over the 2 m_1 differ- 
ent encoding schemes and chooses the scheme that achieves the minimal 
15 distance. It can be shown that even with this brute force technique, it 
is not always possible to obtain the requested output probability of 0.5. 
The following example clarifies this. 

. Example 2, Consider S = {a, b, c, d} and A = {0, 1}. Suppose that the 
requested probability of 0 in the output is /* = 0.5. There are 2 m_1 = 8 
20 possible encoding schemes whose code word lengths are {1, 2, 3, 3} (those 
generated from the Huffman tree depicted in Figure 1 obtained by using 
different labeling strategies) . 

The eight possible encoding schemes, their probabilities of 0 in the 
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output, and their distances, as defined in (14) where 0 = 2, are given in 
Table 1. 
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{0,10,110,111} 


0.47368 


0.0013S5 
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{0,10,111,110} 


0.42105 


0.012466 
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{0,11,100,101} 


0.47368 


0.001385 
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{0,11,101,100} 


0.42105 


0.012466 


5 


{1,00,010,011} 


0.57895 


0.012466 


6 


{1,00,011,010} 


0.52632 


0.001385 


7 


{1,01,000,001} 


0.57895 


0.012466 


8 


{1,01,001,000} 


0.52632 


0.001385 



Table 1: All possible encoding schemes for the code word lengths 
{1, 2, 3, 3}, where S = {a, 6, c, d} } A - {0, 1}, and V = [0.4, 0.3, 0.2, 0.1]. 

Observe that none of these encodings yield the optimal value of /* = 
0.5. Indeed, the smallest distance is 0.001385, obtained when using, for 
5 example, 0i, in which the probability of 0 in the output is 0.47367. 

□ 

As seen in Example 2, it is impossible to find the optimal requested 
probability of 0 in the output even after exhaustively searching over all 
possible encoding schemes. Rather than using a fixed labeling scheme 
10 for encoding the entire input sequence, D_SJHJE m o adopts a different 
labeling scheme each time a bit is sent to the output. These labeling 
schemes are chosen based on the structure of the Huffman tree and the 
distribution of the output sequence encoded so far. 

Basically, D_S JHLE mj 2 works by taking advantage of the sibling prop- 
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erty of the Huffman tree (see Gallager (Variations on a Theme by Huff- 
mBXiJEEE Transactions on Information Theory, 24(6):668-674, (1978))). 
Prom this property, it can be observed that the weight of the left child 
is always greater than or equal to that of the right one. Consider a 

5 Huffman tree, T. By keeping track of the number of 0 5 s, Co(7i), and the 
number of bits already sent to the output, n, the probability of 0 in the 
output is estimated as f(n) = whenever a node of T is visited. 

DJ3JHJB m) 2 takes f(n) and compares it with /* as follows: if }{n) is 
smaller than or equal to /*, 0 is favored by the branch assignment 0-1, 

10 otherwise 1 is favored by the branch assignment 1-0. This is referred to 
as the D_S_H_E m ,2 Rule. 

Essentially, this implies that the process utilizes a composite Huffman 
tree obtained by adaptively blending the exponential number of Huffman 
trees that would have resulted from a given fixed assignment This blend- 

15 ing, as mentioned, is adaptive and requires no overhead. The blending 
function is not explicit, but implicit, and is done in a manner so as to 
force /(7T,) to converge to the fixed point, /*. 

D_S_H_E mj 2 also maintains a pointer, q, that is used to trace a path 
from the root to the leaf in which the current symbol is located in T. This 

20 symbol is then encoded by using the DJ3_H_E mi 2 Rule. This division of 
path tracing and encoding is done to maintain consistency between the 
Encoder and the Decoder. A small example will help to clarify this 
procedure. 
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Example 3. Consider the source alphabet S — {a, 6, c, d} with respec- 
tive probabilities of occurrence V = {0.4,0.3,0.2,0.1} and the code al- 
phabet A = {0, 1}. To encode the source sequence X = bacba, D_SJH_E m>2 , 
the Huffman tree (shown in Figure 1) is first constructed using Huffman's 
5 algorithm (see Huffman (A Method for the Construction of Minimum Re- 
dundancy Codes, f Proceedings of IRE, 40(9):1098-1101, (1952))). 

The encoding achieved by Process DJS JH JE mj 2 is detailed in Table 2. 
First, c 0 (0), ra, and /(0) are all set to 0. The starting branch assignment 
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Table 2: Encoding of the source sequence X = bacba using D_S_HJE m} 9 
and the Huffman tree of Figure 1. The resulting output sequence is 

y = oiioioon. 

is 0-1. For each symbol of X, a path is generated, and for each element 
10 of this path (column 'Go 5 ) using the current assignment, the output is 
generated. Afterwards, 7^o, and f(n) are updated. □ 

From Example 3, the power of DJLH J3 m>2 can be observed. The 
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probability of 0 in the output, /(n), almost approaches /* = 0.5 after . 
generating only 9 bits in the output. Although this appears to be a 
"stroke of luck" , it is later shown, both theoretically and empirically, 
that D_S_H_E m ,2 follows this behavior relatively rapidly. 

5 The Decoding Process : D_S_HJD m ,2 

The decoding process, D_S_H_D m)2 , is much more easily implemented. 
For each bit received from the output sequence, f(n) is compared with 
/*. If f(n) is smaller than or equal to /*, 0 is favored by the branch 
assignment 0-1, otherwise 1 is favored by the branch assignment 1-0. 
10 D_S_H.D mi 2 also maintains a pointer, q, that is used to trace a path from 
the root of T to the leaf where the symbol to be decoded is located. 

The formal decoding procedure is given in Process D_S _HJD„ l)2 , and 
pictorially in Figures 5 and 6. 

Schematic of D_S_H_D m , 2 
15 The process D_S_H_D m> 2 (see Figures 5 and 6). begins with the In- 
put/Output block 100, where the encoded sequence, y = y[l] . . .y[R], 
the root of the Huffman tree, r, and the requested probability of '0' in 
the output, /*, are read. The Huffman tree must be the same as that 
used by Process D_S_H_E mi2 if the original source sequence, X, is to be 
20 correctly recovered. 

In block 110, the estimated probability of '0' in the output, /, and 
the counter of zeros in the output, cq, are initialized to 0. The number 
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of bits processed from the input sequence, n, and the number of symbols 
sent to the output, k, are both initialized to 1. As in D-S_H_E m> 2, other 
straightforward initializations of these quantities are also possible, but 
the Encoder and Decoder must maintain identical initializations if the 

5 original source sequence, X, is to be correctly recovered. The Process 
also initializes a pointer to the current node in the tree, q, which is set 
to the root of the tree, r. 

The decision block 120 constitutes the starting of a looping structure 
that ends (branch labeled "No") when all the bits from 3 ; are processed. 

10 The decision block 130 compares the estimated probability of '0' in y, /, 
with the desired probability, /*, leading to block 140 through the branch 
"Yes" (branch assignment 0-1) if / < /*. The decision block 140 leads 
to block 170 if the current bit is a '0' because the branch assignment 
being used is 0-1 and the path must go to the left. Conversely, when 

15 the current bit is a T, the path must go to the right (block 160). The 
branch "No" of block 130 (when / > /*) leads to the decision block 150, 
which compares the. current bit with '0'. If that bit is a '0', the "Yes" 
branch goes to block 160, in which the current pointer, q, is set to its 
right child. When the current bit is a T (the branch labeled "No" of 

20 block 150), the process continues with block 170, which sets q to its left, 
child, and increments the counter of zeros in y, Co- 

Connector 180 is the continuation of the process after block 160 or 170 
have been processed. Connector 190 is the continuation of the branch 
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"No" of block 120, which implies that all the bits from y have been 
processed. Connector 200 is the continuation of the branch "Yes" of 
block 230, and goes to block 130 to continue processing the current path. 
Connector 210 is the continuation of the looping structure that process 
5 all the bits of y. 

In block 220, the estimated probability of '0' in y } f , is updated using 
the current counter of zeros, and the counter of bits processed. 

The next decision block (block 230) , checks whether or not the current 
pointer, 9, is pointing to a leaf. In the case in which g is pointing to the 
10 leaf associated with the source symbol to be recovered (branch "No 55 ), 
the process continues with block 240, in which the corresponding source 
symbol, x[k], is recovered, the counter of source symbols in X, fc, is 
incremented, and the current pointer, g, is moved to the root of the tree. 

Block 250 occurs when all the bits from y have been processed (branch 
15 "No" of block 120). At this point, the original source sequence, X, is 
completely recovered, and the Process DJ3J3JD m> 2 terminates in block 
260, where the control is returned to the "upper level" process which 
invoked it. 
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Process DJS JHJD m)2 
Input: The Huffman Tree, T- The encoded sequence, y. The 
requested probability of 0, /*. 

Output: The source sequence,'^. 
5 Method: 

c 0 (0)^0;n(-l;/(0)M 
q <- root(T); k = 1 

for n <— 1 to i? do // For all the symbols of the output sequence 
if H n ) ^ /* then // Assignment 0-1 
10 if y[n] = 0 then 

co(n) 4- c 0 (n - 1) + 1; q 4- left(g) 
else 

(7 <r- right(g) 
endif 

15 else // Assignment 1-0 

if y[n] = 0 then 

g <- right(g) 
else 

c 0 (n) 4- c 0 (n -l) + l;<?f- left(g) 
20 endif 
endif 

if g is a "leaf" then 

<- symbol(g); g root(T); & *~ ^ + 1 
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endif 

y ( n ) £eM jj Recalculate the probability of 0 in the output 
endfor 
End Process D_S_HJD m , 2 
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Example 4. Consider the source alphabet <S = {a, 6, c, d] with prob- 
abilities of occurrence V = {0.4,0.3,0.2,0.1} and the code alphabet 
A — {0,1}. The output sequence y = 011010011 is decoded using 
D_S_H_D m> 2 and the Huffman tree depicted in Figure 1. 

5 The decoding process is detailed in Table 3. As in DJ3 JH JE mj 2, co(n) 
and f(n) are set to 0, and the default branch assignment is set to be 0-1. 
For each bit, y[n] y read from y } the current branch assignment and y[n] 
determine the direction of the path traced, or more informally, where 
child q will 'Go': left (L) or right (R). Subsequently, co(n) and f(n) are 

10 updated. Whenever a leaf is reached, the source symbol associated with 
it is recorded as the decoded output. Finally, observe that the origi- 
nal input, the source sequence of Example 3, X = bacba ) is completely 
recovered. □ 

Proof of Convergence of D J3 _HJE m}2 
15 The fundamental properties of D_S JH JE m? 2 are crucial in determining 
the characteristics of the resulting compression and the encryptions which* 
are constructed from it. To help derive the properties of D_S_H_E m j2? the 
properties of the binary input alphabet case are first derived, and later, 
the result for the multi-symbol input alphabet case is inductively derived. 

20 The Binary Alphabet Case 

In this section, the convergence of D_S_H_E m} 2 for a particular case 
when the input has a binary source alphabet is proven. In this case, 
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n 


y\n + 1] 


Cn(n) 


/» 


Assignment 


Go 


x[k] 


0 


0 


0 


0.0000 


0-1 


L 




1 


1 


1 


1.0000 


1-0 


L 


b 


2 


1 


1 


0.5000 


0-1 


R 


a 


3 


0 


1 


0.3333 


0-1 


L 




4 


1 


2 


0.5000 


0-1 


R 




5 


0 


2 


0.4000 


0-1 


L 


c 


6 


0 


3 


0.5000 


0-1 


L 




7 


1 


4 


0.5714 


1-0 


L 


b 


8 


1 


4 


0.5000 


0-1 


R 




9 




5 


0.4444 









Table 3: Decoding of the output sequence y = 011010011 using Process 
DJ3_HJ) m) 2 and the Huffman tree of Figure 1. The source sequence of 
Example 3 is completely recovered. 

the Huffman tree has only three nodes - a root and two children. The 
. encoding process for this case is referred to as DJSJHLEo^. 

Consider the source alphabet, S - {0.1}. The encoding procedure 
is detailed in Process D_S _HJE 2) 2 given below. The symbol s, the most 
5 likely symbol of <S, is the symbol associated with the left child of the 
root, and 5 is the complement of s (i.e. s is 0 if s is 1 and vice versa) is 
associated with the right child. 
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Process D_S_H_E 2 , 2 
Input: The source alphabet, <S, and the source sequence, X. The 
requested probability of 0, /*. 

Output: The output sequence, y. 
5 Method: 

co(0) «- 0; /(l) «- 0; s <- The most likely symbol of <S. 
f or n 1 to M do // For all the symbols of the input sequence 
if f{n) < f* then // Assignment 0-1 



if x[n] = 0 then 
y[n] «- s; c 0 (n) «- c 0 (n 



1) +s 



else 



-i/[n] <- s; c 0 (n) r- Co(?2 



l) + s 



endif 



15 



else // Assignment 1-0 
if x[n] = 0 then 
y[n] <- s; c 0 (n) f- c 0 (n 



l) + s 



else 



y[n] s; c 0 (n) 4- c 0 (n 



l) + s 



endif 



endif 



/(»)<- 



^ // Recalculate the probability of 0 



endfor 



End Process D_S_HJE 2 ,2 
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The Huffman tree for this case is straightforward, it has the root and 
two leaf nodes (for 0 and 1). There are two possible labeling strategies 
with this tree. The first one consists of labeling the left branch with 0 and 
the right one with 1 (Figure 6(a)), which leads to the encoding scheme 

5 0o • S -> C = {0, 1}. The alternate labeling strategy assigns 1 to the left 
branch and 0 to the right branch (Figure 6(b)), leading to the encoding 
scheme <f>\ : S — > C = {1, 0}. Rather than using a fixed encoding scheme 
to encode the entire input sequence, a labeling strategy that chooses the 
branch assignment (either 0-1 or 1-0) based on the estimated probability 

10 of 0 in the output, f(n) is enforced. 

In regard to the proof of convergence, it is assumed that the probabil- 
ity of 0 in the input, p, is greater than or equal to 1 - p, the probability 
of 1 occurring in the input, and hence p > 0.5. This means that the most 
likely symbol of <S is 0. The case in which p < 0.5, can be solved by sym- 

15 metry and is straightforward. It is shown that any probability requested 
in the output, /*, is achieved by the D_SJELE 2) 2, where 1 - p < f* < p. 
The statement of the result is given in Theorem 1 below. The rigorous 
proof of this theorem is given in Appendix B. 

Theorem 1 (Convergence of Process DJS JHJE 2j 2)- Consider a sta- 
20 tionary, memoryless source with alphabet S = {0,1}, whose proba- 
bilities are V = [p, 1 - p], where p > 0.5, and the code alphabet, 
A = {0, 1}. If the source sequence X ~ x[l], . . . , x[n] } . . with x[i) 6 
S, i = 1, . . . , n, . . ., is encoded using the Process D_S_H_E 2> 2 so as to 
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yield the output sequence y = . . . , i/[ri], . . such that y[i] € A 
i = 1, . . . ,n, . . then 

lim.Pr[/(n) = r] = l, (1) 

n-4oo 

5 

where /* is the requested probability of 0 in the output (1 -p < f* < p), 
and /(n) = ^ with c 0 (n) being the number of O's encoded up to time 
n. □ 
The result that D_S_H JE 2j 2 guarantees Statistical Perfect Secrecy fol- 
io lows. 

Corollary 1 (Statistical Perfect Secrecy of Process D_S_HJE 2)2 ). 
The Process D_S JELE 2 ,2 guarantees Statistical Perfect Secrecy. 

Proof. The result of Theorem 1 states that the probability of 0 in the 
output sequence, asymptotically converges to the requested value /*, 

15 where 1 - p < f* < p, with probability 1. In particular, this guarantees 
convergence whenever /* is 0.5. 

Since the value of / is guaranteed to be close to 0.5 after a finite num- 
ber of steps, this, in turn, implies that for every input X, Pr[i/j + i-.yj+jfe], 
the probability of every output sequence of length k occurring, will be 

20 arbitrarily close to ^. The result follows. □ 

Equation (57) (in Appendix B) also guarantees optimal and lossless 
compression. 
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Convergence Properties of DJ3 JH_E2,2 

Theorem 1 proves that f(n) converges with probability 1 to the value 
/*. However, the issue of the transient behavior 1 is not clarified. It is 
now shown that f(n) converges very quickly to the final solution. Indeed, 
5 if the expected value of f(n) is considered, it can be shown that this 
quantity converges in one step to the value 0.5, if /* is actually set to 
0.5. 

Theorem 2 (Rate of convergence of Process DJ3 J3JE 2 ,2)- If/* is 
set at 0.5, then E[/(l)] = 0.5, for D-S_HJE 2t 2, implying a one-step con- 
10 vergence in the expected value. D 

The effect on the convergence (when using a window, which contains 
the last t symbols in the output sequence, to estimate the probability of 
0 in the output) is now discussed. 

When the probability of 0 in the output is estimated using the entire 
15 sequence of size n, it is well known that f(n) converges to /*. Thus, 

/» = £f»M = f. (2) 

2=0 

Trivially, since the convergence is ensured, after k more steps the 
20 estimated probability of 0 in the output again becomes: 

iThe transient behavior, its implications in solving the DODE, and how it can be 
eliminated in any given encryption are re-visited later. 
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v ' n + k 

i=0 



It is straightforward to see that, by estimating the probability of 0 
5 in the output using the last k bits, the modified D_S -H-E2,2 would still 
converge to the same fixed point /*, since 



££22U r>i (4 ) 

i=n+l 



10 were c Q (k) is the number of 0's in the last k bits of the output sequence. 
In other words, this means that instead of using a window of size k 
to implement the learning, one can equivalently use the entire output 
encoded sequence to estimate f(n) as done in D_S_HJE 2 ,2- The current 
scheme will converge to also yield point-wise convergence. 

15 The Multi-Symbol Alphabet Case 

The convergence of D_S_H_E„ l)2 , where m > 2 is now discussed. For 
this case, it can be shown that the maximum probability of 0 attainable 
in the output is f max and the minimum is 1 - / maa; , where f nmx is the 
probability of 0 in the output produced by an encoding scheme obtained 
20 by labeling all the branches of the Huffman tree using the branch assign- 
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ment 0-1. This result is shown by induction on the number of levels of the 
Huffman tree. It is shown that DJS JUE m> 2 converges to the fixed point 
/* at every level of the tree, using the convergence result of- the binary 
case. The convergence of D-SJHJS^ is stated and proved in Theorem 4 
5 below. 

Theorem 3 (Convergence of Process D_SJ3_E mj2 ). Consider a sta- 
tionary, memoryless source with alphabet S = {si, . . , s rn ] whose prob- 
abilities are V = [pi, . . . ,p m ], the code alphabet A = {0, 1}, and a bi- 
nary Huffman tree, T> constructed using Huffman's algorithm. If the 
10 source sequence X = x[l] . . .x[M] is encoded by means of the Process 
D_S_H_E m> 2 and T 5 . to yield the output sequence y = y[l] . . . y[R], then 



limPr[/(n) = r] = l, (5) 

n->oo 



15 where /* is the requested probability of 0 in the output (1 - f max < f* < 
fmax), and f(n) = ^jp^ with co(rc) being the number of 0's encoded up 
to time n. □ 

Corollary 2. The Process DJS_HJE mj 2 guarantees Statistical Perfect Se- 
crecy. □ 
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Empirical Results 

One of the sets of files used to evaluate the performance of the en- 
coding algorithms presented in this research work was obtained from the 
University of Calgary. It has been universally used for many years as 

5 a standard test suite known as the Calgary corpus 2 . The other set of 
files was obtained from the University of Canterbury. This benchmark, 
known as Canterbury corpus, was proposed in 1997 as a replacement 
for the Calgary corpus test suite (see Arnold et al. (A Corpus for the 
Evaluation of Lossless Compression Algorithms, Proceedings of the IEEE 

10 Data Compression Conference pages 201-210, Los Alamitos, CA, IEEE 
Computer Society Press, (1997))). These files are widely used to test and 
compare the efficiency of different compression methods. They represent 
a wide range of different files and applications including executable pro- 
grams, spreadsheets, web pages, pictures, source code, postscript source, 

15 etc. 

Process D_S _H_E m ,2 has been implemented by considering the ASCII 
set as the source alphabet, and the results have been compared to the 
results obtained by traditional Huffman coding. The latter has been con- 
sidered here as the fixed-replacement encoding scheme obtained by using 
20 a binary Huffman tree and the branch assignment 0-1. This method 
is referred to as the Traditional Static Huffman (TSH) method. The 
requested frequency of 0, /*, was set to be 0.5 and 9 was set to be 
2 Available at ftp . cpsc .ucalgary . ca/pub/projects/text . compression . corpus/. 
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2. The empirical results obtained from testing DJ3-H_E m>2 and TSH on 
files of the Calgary corpus and the Canterbury corpus are given in Ta- 
bles 4 and 5, respectively. The first column corresponds to the name of 
the original' file in the Calgary or Canterbury corpus. The second col- 

5 umn fxsH is the maximum probability of 0 attainable with the Huffman 
tree, i.e. the probability of 0 in the output attained by TSH. /dshe 
is the estimated probability of 0 in the output sequence generated by 
D_S_HJE T/l} 2. The last column, d(jF, J 7 *), is the distance calculated by 
using (14), where /dshe IS the probability of 0 in the compressed file 

10 generated by DJSJLE mj 2. 

The last row of Tables 4 and 5, labeled "Average", is the weighted 
average of the distance. 

From Table 4, observe that D_S_H_E m o achieves a probability of 0 
which is very close to the requested one. Besides, the largest distance 

15 (calculated with 0 = 2) is approximately 3.0E-08, and the weighted 
average is less than 7.5E-09. 

On the files of the Canterbury corpus, even better results (Table 5) are 
obtained. Although the worst case was in the output of compressing the 
file grammar. lsp } for which the distance is about 10E-08, the weighted 

20 average distance was even smaller than 5.5E-10. 



51 



SUBSTITUTE SHEET (RULE 26) 



WO 03/028281 



PCT/CAOl/01429 



File Name 


J 1 Oil 


j u o n jzi 




bib 

U1U 




0 500002577 


6 64093E-12 






0 500000000 


0 00000E+00 


U(JL)KZ< 




0 50017S609 


3 m 377E-0S 


(Tort 

geo 


0 ^94Q7R94Q 


0 4QQQQ56Q3 


1 85502E-11 


nc wo 


0 R34862Q68 

v. J J^tUUtt C7UO 


0 500020293 


4 11806E-10 


UUJ J. 




0 500007788 


6 06529E-11 




0 528048693 


0 500083078 


6 90195E-09 


paperl 


0.536035267 


0.500030866 


9.52710E-10 


progc 


0.534078465 


0.500057756 


3.33576E-09 


progl 


0.531706581 


0.499963697 


1.31791E-09 


progp 


0.535076526 


0.499992123 


6.20471E-11 


trans 


0.532803448 


0.499999065 


8.74225E-13 


Average 






7.44546E-09 



Table 4: Empirical results obtained after executing DJS _H_E m> 2 & n d TSH 
on files of the Calgary corpus, where /* = 0.5. 

Graphical Analysis 

The performance of D_S_H_E m ,2 has also been plotted graphically. The 
graph for the distance on the file bib of the Calgary corpus is depicted 
in Figure 7. The z-axis represents the number of bits generated in the 
5 output sequence and the y-axis represents the average distance between 
the probabilities requested in the output and the probability of 0 in the 
output sequence. 

The average distance mentioned above is calculated as follows. For 
each bit generated, the distance, d(f,f*) } is calculated by using (14), 
10 where the probability of 0 requested in the output is /* = 0.5, and / 
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J. llv 1 N Chilli 


JTSH 


J Doii Jb 


diF F*) 

Km x %/ . , «/ y 


ancezy.txu 




0 40004440^ 


O . U U O U Hr U i? 


asyoulik.txt 




fl 4QGQRftna4 

u.^yyyoouoft 


1 Q9fi47F 10 


cp.html 


U.Dooy / o / 66 


n ^nnnnnnnn 
u.ouuuuuuuu 


n nnnnnP-i-nn 
u.uuuuuji/-ruu 


fields, c 


u.DoooDiyoo 


u.ouuuooyoy 


7 01 p no 


grdiiiiiicii • lop 


U.JOOJt:J i tcO 




' i 10^10^-07 

X.XZJU ±\JJLJ VJ I 


kennedv xls 


0.535706989 


0.500000278 


7.70618E-14 


lcetl0.txt 


0.535809446 


0.499989274 


1.15043E-10 


plrabnl2.txt 


0.542373081 


0.500000454 


2.05753E-13 


ptt5 


0.737834157 


0:500000587 


3.44100E-13 


xargs.l 


0.532936146 


0.499879883 


1.44281E-08 


Average 






5.42759E-10 



Table 5: Empirical results of D_S_H_E m>2 and TSH tested on files of the 
Canterbury corpus, where /* = 0.5. 

is estimated as the number of O's, co(n), divided by the number of bits 
already sent to the output. In order to calculate the average distance, the 
output was divided into 500 groups of g\ bits and the average distance is 
calculated by adding cf(/, /*) over each bit generated in the group, and 
5 dividing it by Q{. 

From the figure it can be seen that the distance drops quickly towards 
0, and remains arbitrarily close to 0. Such convergence is typical for all 
the files from both the Calgary corpus and Canterbury corpus. 
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D_A_H_E m2 : The Deterministic Embodiment using 
Adaptive Huffman Coding 

The Encoding Process 

The technique for solving DODC by using Huffman's adaptive cod- 

5 ing scheme is straightforward. Again, the design takes advantage of the 
sibling property of the tree at every instant to decide, on the labeling 
strategy used at that time. Apart from this, the process has to incorpo- 
rate the learning stage in which the probabilities of the source symbols 
are estimated. This is a fairly standard method and well documented in 

10 the literature (see Hankerson et al. (Introduction to Information The- 
ory and Data Compression, CRC Press, (1998)), Gallager (Variations 
on a Theme by Huffman, IEEE Transactions on Information Theory, 
24(6):668-674, (1978)), Knuth (Dynamic Huffman Coding, Journal of 
Algorithms, 6:163-180, (1985)), and Vitter (Design and Analysis of Dy- 

15 namic Huffman Codes, Journal of the ACM, 34(4):825-845, (1987))). 

The encoding procedure for the distribution optimizing algorithm us- 
ing the Huffman coding in an adaptive manner is detailed in Process 
D_A_H_E mi 2 given below, and pictorially in Figures 9 and 10. 

Schematic of D_A_H_E m2 
20 The schematic chart of Process D_A_H_E m>2 is explained here. In 
block 100, the source sequence to be encoded, X = x[l) . . .x[M], and 
the requested probability of '0' in the output, /*, are read from the in- 
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put. In block 110, a Huffman tree is constructed following the procedure 
presented in Huffman (A Method for the Construction o Minimum Re- 
dundancy Codes, Proceedings of IRE, 40(9), pp. 1098-1101, (1952)), by 
assuming a suitable probability distribution for the source symbols. This 
5 procedure returns the Huffman tree by means of a pointer to the root, 
r, and a hashing function, /i(.), that is used later to locate the node 
associated with the symbol being processed from the input. 

In block 120, the estimated probability of '0' in the output, /, and 
the counter of zeros in the output, c 0 , are set to 0. The corresponding 

10 counters for the number of bits sent to the output, n, and for the number 
of symbols processed from the input, fc, are set to 1. As in D_Sj*LE mj2) 
other straightforward initializations of these quantities are also possible, 
but the Encoder and Decoder must maintain identical initializations if 
the original source sequence, X, is to be correctly recovered. 

15 The decision block 130 constitutes the starting of a looping structure 
that processes the M symbols from the input. The branch "Yes" leads 
to block 140, which sets the current pointer, q ) to the node associated 
with the current symbol being processed, x[k] 7 by invoking the hashing 
function /i(.). The length of the path from the root to the node associated 

20 with x[k], £, is set to 0, and a temporary pointer, g Xj used later when 
updating the Huffman tree, is set to q so as to store the pointer to the 
node associated with x[k]. 

The decision block 150 constitutes the starting of a looping structure 
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that processes all the edges in the path. When the current pointer, q } is 
not the root, the process continues by tracing the path towards r (branch 
"Yes"), going to block 160, which increments the length of the path, £. 
The next block in the process is the decision block 170, which checks if 

5 the path comes from the left child (branch "Yes" where the path is set to 
true in block 180), or from the right child, where the path is set to false 
in block 190. The current pointer, q, is then moved to its parent (block 
195). The branch "No" of block 150 indicates that the root of the tree 
has been reached, and the branch assignment process starts in block 230 

10 (reached through Connector 200), which sets a counter of bits to be sent 
to the output, to the length of the path, £. 

. Connector 210 is the continuation of the branch "No" of the decision 
block 130, and indicates that all the symbols coming from the input 
have been processed. Connector 220 is the continuation of the looping 

15 structure that processes all the symbols of the input. 

The decision block 240 constitutes the starting of a looping structure 
that processes all the edges in the current path. When j > 0 (branch 
"Yes"), the process continues by performing a labeling of the current 
edge (block 250). If the estimated probability of £ 0 5 in the output, /, is 

20 less than or equal to the required probability, /*, the branch assignment 
0-1 is used, and hence when the path leads to the left (true or branch 
"Yes" of block 260), a '0' is sent to the output and the counter of zeros 
is incremented (block 290). Conversely, when the path goes to the right 
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(false or branch "No" of block 260), a T is sent to the output (block 
280). The branch "No" of block 250 (/ > /*) indicates that the branch 
assignment 1-0 is used, leading to the decision block 270, in which the 
branch "Yes" leads to block 280 (a T is sent to the output), and the 

5 branch "No" leads to block 290 (a '0' is sent to the output and the counter 
of zeros, Co, is incremented). 

In block 300, the estimated probability of '0' in the output, /, is 
updated using the current counter of zeros and the current counter of 
bits. The counter of edges in the path, j, is decremented (block 310), 

10 since the path is now being traced from the root to q\. 

When all the edges in the path are processed (branch' "No" of block 
240), the counter of symbols processed from the input, fc, is increased 
(block 320), and the Huffman tree is updated by invoking the procedure 
described in Knuth (Dynamic Huffman Coding, Journal of Algorithms, 

15 Vol 6, pp. 163-180, (1985)), or Vitter (Design and Analysis of Dynamic 
Huffman Codes, Journal of the ACM, 34(4):825-845, (1987)), returning 
the updated hashing function, /i(.), and the new root of the tree, r, if 
changed. 

When all symbols from the input have been processed (branch "No" 
20 of block 130) the encoded sequence, y, is stored (Input/Output block 
340), the Process D_AJ3_E m>2 terminates, and the control is returned to 
the "upper-level" process (block 350) that invoked it. 

57 



SUBSTITUTE SHEET (RULE 26) 



WO 03/1)2828) 



PCT/CAII.1/01429 



Process D_A_HJE m 2 
Input: The source alphabet, S. The source sequence, X. The 
requested probability of 0, /*. 

Output: The output sequence, y. 
5 Assumption: It is assumed that there is a hashing function which 
locates the position of the input alphabet symbols as leaves in T. It is 
also assumed that the Process has at its disposal an algorithm (see [5], 
[16]) to update the Huffman tree adaptively as the source symbols come. 
Method: 

10 Construct a Huffman tree, T, assuming any suitable distribution 

for the symbols of S. -In this instantiation, it is assumed that the symbols 
are initially equally likely. 

c 0 (0) <- 0; n <- 1; /(l) «- 0; 

for k «- 1 to M do // For all the symbols of the input sequence 
15 Find the path for x[k]; q «- root(T) 

while q is not a leaf do 
if /(n) < /* then // Assignment 0-1 
if path is "left" then ' 
y[n] <- 0; c 0 (n) <- co(n - 1) + 1; q left(g) 
20 else 

y[n] 4- 1; q <- right(g) 
endif 

else // Assignment 1-0 
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if path is "left" then 
y[n] <- 1; q <— left(g) 
else 

y[n] <r- 0; co(n) co(n - 1) + 1; q right(g) 
5 endif 
endif 

/( n ) <_ £2M; n <_ n + 1 

endwhile 

Update T by using the adaptive Huffman approach [5], [16]. 
10 endfor 

End Process D_A_HJ5 m2 

The Decoding Process 

The decoding process for the adaptive Huffman coding is also straight- 
forward. It only requires the encoded sequence, y, and some conventions 

15 to initialize the Huffman tree and the associated parameters. This proce- 
dure is formalized in Process D_A_H_D m/2j and pictorially in Figures 11 
and 12. Observe that, unlike in the static Huffman coding, the Decoder 
does not require that any extra information is stored in the encoded 
sequence. It proceeds "from scratch" based on prior established conven- 

20 tions. 

Schematic of D_AJH_D m o 

The Process D_A JHJ) m)2 , depicted in Figures 11 and 12, begins with 
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the Input/Output block 100, where the encoded sequence, y = y[l) . . . y[R]. 
and the requested probability of '0', /*, are read. In block 110, the pro- 
cedure that constructs the Huffman tree is invoked using the same initial 
probability distributions for the source symbols as those used in the Pro- 

5 cess D_A_HJE mj 2 so as to recover the original source sequence, X . Again, 
the tree is referenced by a pointer to its root, r. 

In block 120, the estimated probability of '0' in y ) /, and the counter 
of zeros in y, Co, are both set to 0. The counter of bits processed, n, 
and the counter of symbols sent to the output, fc, are set to 1, and the 

10 current pointer, q, is set to the root of the tree, r. As in D_A_HJE m> 2, 
other straightforward initializations of these quantities are also possible, 
but the Encoder and Decoder must maintain identical initializations if 
the original source sequence, X } is to be correctly recovered. 

The decision block 130 constitutes the starting of a looping structure 

15 that processes all the bits from y. The next block is the decision block 
140, which compares the estimated probability of '0 5 in y, /, to the 
requested probability, /*. If / < /*, the branch assignment 0-1 is used, 
and hence in block 150, when a '0* comes from y, the branch "Yes" 
moves q to its left child (block 180), and when y[n) is a T, the current 

20 pointer, q, goes to its right child (block 170). The branch "No" of the 
decision block 140 (/ > /*) indicates that the branch assignment 1-0 is 
used, leading to the decision block 160. When the current bit is a '0', 
the current pointer is set to its right child (block 170), and when y[n] 
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is a T, the current pointer, g, is set to its left child, and the counter of 
zeros in y, cq, is increased (block 180). 

The. process continues through connector 190, leading then to block 
230 in which the estimated probability of '0' in y, /, is updated using 

5 the current counter of zeros, Co, and the counter of bits processed, n. 
The process continues by performing the decision block 240, where the 
current pointer, q, is tested. If q is not a leaf (i.e. its left child is not 
nil), the process continues with the next bit (the next edge in the path) 
going to block 140 through connector 210. Otherwise the corresponding 

10 source symbol, x[k], is recovered, and the counter of source symbols, k, 
is increased (block 250). The process continues with block 260, where 
the tree is updated by invoking the same updating procedure as that of 
the D_A_H JE m>2 so that the same tree is maintained by both the Encoder 
and Decoder. In block 270, the current pointer, <j, is set to the root r, 

15 and the process continues with the decision block 130 through connector 
220 so that the next source symbol is processed. 

The branch "No" of the decision block 130 indicates that all the bits 
of y have been processed, going (through connector 200) to the In- 
put/Output block 280 in which the original source sequence, X, is stored. 

20 The Process D_A._H_D mi2 terminates in block 290. 
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Process D_A_H_D m]2 
Input: The source alphabet, S. The encoded sequence, y. The 
requested probability of 0, /*. 

Output: The source sequence, X . 
5 Assumption: It is assumed that there is a hashing function which 
locates the position of the input alphabet symbols as leaves in T. It is 
also assumed that the Process has at its disposal an algorithm (see [5], 
[16]) to update the Huffman tree adaptively as the source symbols come. 
Method: 

10 Construct a Huffman tree, T, assuming any suitable distribution 

for the symbols of S. In this instantiation, it is assumed that the symbols 
are initially equally likely. 

co(0) <- 0: n <- 1; f{l) <- 0 
. q <- root(T); k f- 1 
15 for n 4- 1 to R do // For all the symbols of the output sequence 

if f{n) < f* then // Assignment 0-1 
if y[n] = 0 then 

co(n) f-co(n-l) + l;g<- left(g) 
else 

20 q<- right (q) 

endif 

else // Assignment 1-0 
if y[n] — 0 then 
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q <r- right(g) 
else 

co{n) 4- co(n - 1) + 1; q <- left(g) 
endif 
5 endif 

if q is a "leaf" then 
x[k] <r- symbol^) 

Update T by using the adaptive Huffman approach [5], [16]. 
q <- root(T); k +- k + 1 

10 endif 

y( n ) ^_ £eM // Recalculate the probability of 0 in the output 

endfor 
End Process D_AJHJD mi2 

Proof of Convergence 

15 The solution of DODC using D_A_H_E 2>2 works with a Huffman tree 
which has three nodes: the root and two children. This tree has the 
sibling property, hence the decision in the branch assignment rule is based 
on the value of f{n) compared to the requested probability of 0 in the 
output, /*. The proof of convergence for this particular case is given 

20 below in Theorem 5. 

Theorem 4 (Convergence of Process D_A_HJE 2> 2)- Consider a 
memoryless source with alphabet S = {0, 1}, whose probabilities are 
p =\p,l-p], where p > 0.5, and the code alphabet, A = {0, 1}. If the 
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source sequence X = x[l], x[n], . . ., with x[i] G S, i = 1, : . . ,n, . . ., is 
encoded using the Process D_A_H_E 2 ,2 so as to yield the output sequence 
y = y[l], • • • » y[n], • • •. such that »[*] e A * = 1 > ■ • • ' n > • ' •' then 

lim Pr[/(n) = f ] = 1 , (6) 



where /* is the requested probability of 0 in the output, (l-p < /* < p), 
and /(n) = ^ with c 0 (n) being the number of 0's encoded up to time 
n. • □ 

10 The Huffman tree maintained in D_A_H_E m , 2 satisfies the sibling prop- 
erty, even though its structure may change at every time V. There- 
fore, the convergence of D_A_H_E m , 2 follows from the convergence of 
D_A_H_E 2 ,2- The convergence result is stated in Theorem 6, and proved 
in Appendix B. Here, the fact that the estimated probability of 0 in the 

15 output can vary on the range [1 - f max , fmax] is utilized. The final result 
that D_AJH_E m , 2 yields Statistical Perfect Secrecy follows. 
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Theorem 5 (Convergence of Process D_A_H_E m , 2 ). Consider a mem- 
oryless source with alphabet S = {si,...,s m } whose probabilities are 

V = [Pi> • • ■ >PmJ, and the code al P habet A = {°» l Y If the source'se- 
quence X = x[l] . . . a;[Af ] is encoded by means of the Process D_A _H_E TO , 2 , 
5 generating the output sequence y = y[l] . . . y[fl], then 

lira Pr[/(n) = f } = 1 , (7) 

n->oo 

where /* is the requested probability of 0 in the output (1 - f max < f* < 
10 fmax), and /(n) = with co(n) being the number of O's encoded up 
to time 'n'. ^ 

Corollary 3. The Process D_A_H_E m>2 guarantees Statistical Perfect 
Secrecy. ^ 

Empirical Results 

15 In order to demonstrate the performance of D_A_H_E m) 2, the latter 
has been tested on files of the Calgary corpus. The empirical results 
obtained are cataloged in Table 6, for which the requested probability of 
0 in the output is /* = 0.5. The column labeled Jtah corresponds to 
the estimated probability of 0 in the output after encoding the files using 

20 the Tradition Adaptive Huffman (TAH) scheme. The column labeled 
/dahe corresponds to the estimated probability of 0 in the output of 
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DJV_H_E m ,2- Observe that D JV JUE m)2 performs very well on these files, 
and even sometimes attains the value 0.5 exactly. This is also true for 
the average performance, as can be seen from the table. 



File Name 


/tah 


Idahe 




bib 


0.534890885 


0.500002567 


6.59052E-12 


bookl 


0.539382961 


0.500000142 


2.02778E-14 


book2 


0.536287453 


0.499.818087 


3.30923E-08 


geo 


0.526089998 


0.499985391 


2.13437E-10 


news 


0.534758020 


0.500027867 


7.76559E-10 


objl 


0.527805217 


0.497746566 


5.07797E-06 


obj2 


0.527944809 


0.500000000 


O.0OO0OE+O0 


paperl 


0.535224321 


0.500513626 


2.63812E-07 


progc 


0.535104455 


0.500121932 


1.48674E-08 


progl 


0.535598140 


0.500010118 


1.02366E-10 


progp 


0.535403385 


0.500000000 


0.00000E+00 


trans 


0.533495837 


0.500001909 


3.64352E-12 


Average 






6.46539E-08 



Table 6: Empirical results obtained after running D_A JHJE mj 2 and TAH 
on the files of the Calgary corpus, where /* = 0.5. 

D JV_HJE m)2 was also tested on files of the Canterbury corpus. The 
5 empirical results are presented in Table 7, where, as usual, the optimal 
probability of 0 in the output was set to be /* = 0.5. The power of 
D_A_H_E m)2 is reflected by the fact that the average distance is less 
than 9.0E-10, and that D_AJHJE m> 2 performed extremely well in all 
the files contained in this benchmark suite. Observe that the perfor- 
10 mance achieved by DjVJHJE mt 2 is as good as that of the static ver- 
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sion D_S_HJ3 mj2 . Additionally, it is important to highlight the fact that 
the former does not require any overhead (statistical information of the 
source) in the encoded data so as to proceed with the encoding and 
decoding processes. 



File Name 


Itah 


Jdahe 


d{F,?*) 


alice29.txt 


0.543888521 


0.499945970 


2.91922E-09 


asyoulik.txt 


0.538374097 


0.500012320 


1.51787E-10 


cp.html 


0.537992949 


0.500003807 


1.44948E-11 


fields, c 


0.534540728 


0.499878683 


1.47179E-08 


grammar, lsp 


0.537399558 


0.499919107 


6.54369E-09 


kennedy.xls 


0.535477484 


0.499999865 


1.82520E-14 


lcetl0.txt 


0.538592652 


0.499947023 


2.80657E-09 


plrabnl2.txt 


0.542479709 


0.499999531 


2.19961E-13 


ptt5 


0.736702657 


0.500000000 


0.OOO00E+00 


xargs.l 


0.536812041 


0.500136005 


1.84974E-08 


Average 






8.89609E-10 



Table 7: Empirical results of D_A_HJS m>2 and TAH tested on the files of 
the Canterbury corpus, where /* = 0.5. 

5 In order to provide another perspective about the properties of D_A_H JE m> 2, 
a graphical display of the value of d{T, F) obtained by processing the 
files bib of the Calgary corpus is included. The plot of the distance for 
this file is shown in Figure 12. 

Observe that the convergence of the estimated probability of 0 occurs 

10 rapidly, and the distance remains arbitrarily close to 0 as n increases. 
From our experience, in general, D_A J3_E m) 2 converges faster than the 
static version D_S_HJE mj 2. The results presented here are typical - anal- 
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ogous results are available for the other files of the Calgary corpus and 
for files of the Canterbury corpus. 

The Randomized Embodiments using Adaptive 

Huffman Coding 

5 This section describes a few randomized embodiments for solving the 
Distribution Optimizing Data Compression problem. These processes 
are obtained by incorporating various stochastic (or randomized) rules 
in the decision of the labeling assignment. In general, the various em- 
bodiments operate on the general Oommen-Rueda Tree ) but the embod- 

10 iments described here specifically utilize a Huffman tree. Also, although 
the Huffman tree can be statically or adaptively maintained, to avoid 
repetition, the randomized embodiments are described in terms of the 
Adaptive Huffman Tree. 

The Encoding Process 

15 The first embodiment described is RV_A_HJE mj2 . The underlying tree 
structure is the Adaptive Huffman Tree. The Branch Assignment Rule 
is randomized, and thus an branch assignment decision is made after 
invoking a pseudo-random number. As per the nomenclature mentioned 
earlier, it is referred to as RV_AJHJE m}2 - 

20 Consider the source alphabet. S = {si, . . . , s m } and the code alphabet 
j{ = {0, 1}. Consider also an input sequence X = x[l]. ..x[M], which 
has to be encoded, generating an output sequence y = y[l] . . The 
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formal procedure for encoding X into y by means of a randomized solu- 
tion to the DODC, and utilizing the Huffman tree in an adaptive manner 
is formalized in Process RV_A_H _E m , 2 below. The process is also given 
pictorially in Figure 14 and 15. The decision on the branch assignment 
5 to be used at time V is based on the current pseudo-random number 
obtained from the generator, RNG[next_random_number], assumed to be 
available for the system in which the scheme is implemented. In general, 
any cryptographically secure pseudo-random number generator can be 
used. 
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Schematic of RV_A_H JE m 2 

The Process RV_A_H_E m , 2) depicted in Figures 14 and 15, begins in 
block 100, in which the input sequence, X = x[l] . ..x[M], the requested 
probability of '0' in the output, /*, and a user-defined seed, fi, are read. 

5 The initial seed must be exactly the same seed as used by the decoding 
Process RV_A_H JD m>2 to correctly recover the plaintext. A Huffman tree 
is then constructed in block 110 by following the procedure presented 
in Huffman (A Method for the Construction o Minimum Redundancy 
Codes, Proceedings of IRE, 40(9), pp. 1098-1101, (1952)), and assuming 

10 a suitable initial probability distribution for the source symbols. This 
procedure returns a pointer to the root of the tree, r, which serves as a 
reference to the entire tree, and a hashing function, h{s), which is used 
later to locate the node associated with s. 

In block 120, a pseudo-random number generator (RNG) is initialized 

15 using the user-defined seed, (3. 

The next step consists of initializing the estimated probability of '0' 
in the output, /, and the counter of zeros in the output, c 0 , both to 0 
(block 130). In this block, the counter of bits sent to the output, n, and 
the counter of source symbols processed from the input, k, are set to 1. 

20 Otherwise, the next decision block (block 140) constitutes the starting 
point of an iteration on the number of source symbols coming from the 
input. When there are symbols to be processed (the branch "Yes"), the 
process continues with block 150, where the current node, g, is set to the 

70 



SUBSTITUTE SHEET (RULE 26) 



WO (13/(12X281 



PCT/CAO I/O 1429 



corresponding value that the hashing function returns after being invoked 
with the current source symbol as a parameter. The node associated with 
x[k], qi, is set to the current node, q, and the length of the path from 
that node to the root, t, is set to 0. The decision block 160 evaluates if 

5 the current pointer, q, has not reached the root of the tree. In this case, 
the process continues with block 170, where the length of the path, t t is 
increased. The next block in the process is the decision block 180, which 
evaluates if the current node, q, is' a left child or a right child. The path 
is set to true (block 190) in the former case, and to false (block 200) in 

10 the latter case. The process then continues with block 205, where the 
current pointer, q, is moved to its parent. 

The branch "No" of the decision block 160 indicates that the current 
node has reached the root, and the process continues with block 240 
through connector 210, where the counter of edges in the current path, 

15 j, is initialized to £, the length of the path. The decision block 250 
constitutes the starting point of a looping structure that process all the 
nodes in the current path. When there are edges remaining in the path 
(j > 0), the n </l random number obtained from RNG is compared to the. 
estimated probability of '0' in the output, /, (block 260). If RNG[n] > 

20 /, the branch assignment 0-1 is used, continuing the process with the 
decision block 270, which leads to the left child if path[£] is true. In this 
case, a '0' is sent to the output and the counter of zeros in the output, 
c 0 , is incremented (block 310). If path[£] is false, the process continues 
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with block 290, where a T is sent to the output. The branch "No" of 
the decision block 260 implies that the branch assignment 1-0 is used, 
continuing the process with the decision block 280. When path[£| is true, 
a c r is sent to the output (block 290), otherwise the process continues 

5 with block 310. The next block in the process is block 320, where the 
estimated probability of C 0 J in the output, /, is updated and the counter 
of bits sent to the output, n, is increased. The next block (block 320) 
decreases the counter of edges in the path, j, and continues with the next 
edge in the path (the decision block 250)'. 

10 When all the edges in the current path have been processed, the pro- 
cess continues with block 330, where the counter of source symbols pro- 
cessed, k, is incremented. The next block in the process is block 340 in 
which the Huffman tree is updated by invoking the procedure introduced 
in Knuth (Dynamic Huffman Coding, Journal of Algorithms, Vol. 6, pp. 

15 163-180, (1985)), or Vitter (Design and Analysis of Dynamic Huffman 
Codes, Journal of the ACM, 34(4):825-845, (1987)), returning the up- 
dated hash function, /i(.), and the new root of the tree, r, if changed. 
The process continues with the decision block 140 (through connector 
230). 

20 When all the source symbols have been processed (branch "No" of the 
decision block 140), the process continues with the Input/Output block 
350 (through connector 220), where the encoded sequence, y, is stored. 
The process then terminates in block 360. 
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Process RV_A_H_E m , 2 
Input: The source alphabet, S. The source sequence, X. A user- 
defined seed, p. It is assumed that we are converging to the value of 
f* = 0.5. 

5 Output: The output sequence, y. 

Assumption: It is assumed that there is a hashing function which 
locates the position of the input alphabet symbols as leaves in 7". It 
is also assumed that the Process has at its disposal an algorithm (see 
Knuth [5] and Vitter [16]) to update the Huffman tree adaptively as the 
10 source symbols come. 
Method: 

Construct a Huffman tree, T, assuming any suitable distribution 
for the symbols of S. In this instantiation, it is assumed that the symbols 
are initially equally likely. 
15 Initialize a pseudo-random number generator, RNG, using (5. 

co(0) *- 0; n <- 1; /(l) f- 0 

for i <- 1 to M do // For all the symbols of the input sequence 
Find the path for x[i) 
q «- root(T) 
20 while q is not a leaf do 

if RNG[next_randomjiumber] > /(n) then . // Assignment 

0-1 . 

if path is "left" then 
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y[n] f- 0; c 0 (n) c 0 (n - 1) + 1; q f- left(g) 
else 

y[n] <- 1; g <- right (g) 
endif 

5 else // Assignment 1-0 

if path is "left" then 
y[n] *- 1; g *- left(g) 
else 

f- 0; c 0 (n) <- c 0 (n - 1) + 1; g «- right(g) 

10 endif 
endif 

endwhile 

Update T by using the adaptive Huffman approach [5], [16]. 
15 endfor 

End Process RV_A_H_E m2 

Rationale for the Encoding Process 

Process RV_A_H JE mj2 works by performing a stochastic rule, based on 
a pseudo-random number, a Huffman tree, and the estimated probability 
20 of 0 in the output at time V, f(n). At time V, the next pseudo-random 
number, a, obtained from the generator is compared with the current 
value of /(n). Since a is generated by a uniformly distributed random 
variable, the branch assignment 0-1 is chosen with probability 1 - /(n), 
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and hence the branch assignment 1-0 is chosen with probability f(n) . 
Let us denote v% = — r 1 — i where ro< is the weight of the i* A node of 
the Huffman tree, and is the weight of its right sibling. Whenever 
f(n) < 0.5, the branch assignment 0-1 implies that the output is to be 

5 a 0 with probability p^ and a 1 with probability 1 — p*. Since pi > 0.5 
(because of the sibling property), the number of 0 5 s in the output is 
more likely to be increased than the number of l's. This rule causes 
f(n) to move towards /* = 0.5. Conversely, when f(n) > 0.5, the 
branch assignment 1-0 causes f(n) to move downwards towards /* = 0.5. 

10 This makes / (n) asymptotically converge to the fixed point /* = 0.5 as 
n -> oo. This is formally and empirically shown later in this section. 

It is important to note that Process RVJVJHJ3 m>2 works only when 
the requested probability of 0 in the output is /* = 0.5. However, 
RV-A _H_E m)2 can easily be adapted to work with any value of /*, when- 

15 ever 1 - f max < f* < / max . This can be achieved by generating random 
numbers from a random variable whose mean is the desired /*. 
An example will clarify the above branch assignment rule. 

Example 5. Suppose that, at time V, the decision on the branch as- 
signment has to be made. Suppose also that a = 0.5972... is the returned 
20 pseudo-random number, and that f(n) = 0.3. Both of these values are 
used to make the decision on the assignment. These values, the interval 
[0,1], and the range corresponding to each branch assignment are de- 
picted in Figure 15. From these arguments, it is easy to see that since 
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/(n) < /* = 0.5, it is more likely to send a 0 to the output than a 1, thus 
increasing f(n + 1) from 0.3 towards /*. This shows the way by which 
RV^A-HJE m> 2 works so as to asymptotically converge to /*. □ 

The Decoding Process 

5 The decoding process, RV_A_H_D mj 2, follows from RV-A_H_E mj2 , but 
in the reverse manner. As in the case of the D_AJ3JE mi 2 decoding pro- 
cess, RV_A_HJE mj 2 works by keeping track of the -number of O's already 
read from the encoded sequence. These are used to estimate the probabil- 
ity of 0, /(n), at each time instant. Also, it is assumed that RV_AJHJE mj2 

10 and RV_A JLD mj2 utilize the same sequence of pseudo-random numbers 
to ensure that the encoded sequence is correctly decoded. Using this 
information and the Huffman tree (which is adaptively maintained), the 
stochastic rule is invoked so as to decide which branch assignment is to 
be used (either 0-1 or 1-0) to decode the given string. The actual pro- 

15 cedure for the decoding process that uses Huffman coding adaptively is 
formalized in Process RV.A ~H-D mj2 and given pictorially in Figures 17 
and 18. 

Schematic of RV_A_H_D m 2 

The schematic of Process RV_A_HJD mi2 given in Figures 17 and 18 is 
20 explained below. The process starts with the Input/Output block 100, 
where the encoded sequence, y = yl] . . the requested probability 

of C 0 J in the output, /*, and a user-defined seed, are read. The initial 
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seed must be exactly the same seed as used by the Process RV_A_H_E m>2 
to correctly recover the plaintext. 

The next block (block 110) generates a Huffman tree by invoking the 
same procedure as that of the Process RV_A_H_E m]2 with the same initial 

5 probabilities for the source symbols, returning the root of the tree, r. 
The next block in the process is block 120 which initializes the same 
pseudo-random number generator (RNG) as that of Process RV_A._H-E m> 2, 
using the user-defined seed p. In block 125, the estimated probability 
of '0' in y, /, and the counter of zeros in y, c 0 , are both set to 0. The 

io counter of bits processed from y, n, and the counter of source symbols 
processed, k, are set to 1. As in the processes described earlier, other 
straightforward initializations of these quantities are also possible, but 
the Encoder and Decoder must maintain identical initializations if the 
original source sequence, X, is to be correctly recovered. In this block, 

15 the pointer to the current node, q, is set to the root r. 

The next block of the process is the decision block 130, which con- 
stitutes the starting point of an iteration on the number of bits pro- 
cessed. The "Yes" branch of this block indicates that more bits have to 
be processed, continuing with the decision block 140, in which the cur- 

20 rent pseudo-random number from RNG is compared with the estimated 
probability of '0' in y. When RNG[n] > /, the branch assignment 0-1 is 
being used, and the process continues with the decision block 150, which 
tests if the current bit is a '0'. In this case, the process continues with 
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block 180, where the current pointer, g, is moved to its left child and the 
counter of zeros in y is incremented. Conversely, when the current bit is 
a T, the process continues with block 170, in which the current pointer, 
. q ) is moved to its right child. 

5 The branch "No" of block 140 continues with the decision block 160, 
which tests the current bit. If this bit is a '0', the process continues with 
block 170, otherwise, it goes to block 180. 

The next block in the process is block 230 (reached through connector 
190), where the estimated probability of '0' in y is re-calculated. The 
10 decision block 240 tests if the current pointer has not reached a leaf. In 
this case, the process continues with the next edge in the path, going 
to the decision block 140 (reached through connector 210). The branch 
"No" of the decision block 240 indicates that a leaf is reached, and the 
corresponding source symbol, x[k], is recovered (block 240). The counter 
15 of source symbols, fc, is thus incremented. 

The next block in the process is block 250, in which the Huffman 
tree is updated by invoking the same updating procedure as that of the 
Process RVJVJ3JE m ,2« In block 260, the current pointer, q, is moved to 
the root r, and the process continues with the decision block 130 (reached 
20 through connector 220) . 

When all the bits from y are processed (branch "No" of the deci- 
sion block 130), the process continues with the Input/Output block 270 
(reached through connector 200). The source sequence X is stored, and 
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the process ends in block 280. 
Process RV_AJH_D m , 2 
Input: The source alphabet, S. The encoded sequence, y. A user- 
defined seed, 0. It is assumed that we are converging to the value of 
5 /* = 0.5. 

Output: The source sequence, X. 

Assumption: It is assumed that there is a hashing function which 
locates the position of the input alphabet symbols as leaves in T. It 
is also assumed that the Process, has at its disposal an algorithm (see 
10 Knuth [5] and Vitter [16]) to update the Huffman tree adaptively as the 
source symbols come. In order to recover the original source sequence, 
X, p must be the same as the one used in Process RVJ^JHJE m> 2. 

Method: 

Construct a Huffman tree, T, assuming any suitable distribution 
15 for the symbols of S. In this instantiation, it is assumed that the symbols 
are initially equally likely. 

Initialize a pseudo-random number generator, RNG, using @. 
c 0 (0) 4- 0; n <- 1; f(l) <- 0 
q 4- root(T); k <- 1 
20 for n 1 to R do // For all the symbols of the output sequence 

if RNG[next_random_uumber] > f(n) then // Assignment 0-1 
if y[n] = 0 then 
co(n) c 0 (n - 1) + 1; q <- left(g) 
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else 

q <r- right(g) 
endif 

else // Assignment 1-0 
5 if y[n] = 0 then 

q right(g) 
else 

c 0 (n) c 0 (?i - 1) + 1; q left(g) 
endif 
10 endif 

if g is a "leaf" then 
x[fc] <— symbol(g); 

Update T by using the adaptive Huffman approach [5], [16]. 
g<-root(T); k<r-k + l 
15 endif 

^( n ) 4_ £sM // Recalculate the probability of 0 in the output 

endfor 

End Process RV_A JHJD m2 

The Proof of Convergence 
20 Prior to considering the more general case of RV_A -H-E m ,2 the con- 
vergence of the binary-input case, RV_A-BLE 2j 2 is first discussed. This 
particular case uses an adaptively constructed Huffman tree which has 
three nodes : the root and two children. This tree has the sibling prop- 
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erty, and hence the decision on the branch assignment is based on the 
value of f(n) and the pseudo-random numbers generated. Consider the 
case in which the requested probability in the output is /* = 0.5. The 
proof of convergence 3 for this particular case is stated in Theorem 7, and 
5 proven in the Appendix B. 4 

Theorem 6 (Convergence of Process RV_A_HJE 2)2 ). Consider a 
memory less source whose alphabet is S = {0, 1} and a code alphabet, 
A = {0, 1}. If the source sequence X = x[l], x[n], . . ., with x[i] e 
«S, % = 1, . . . , n, . . ., is encoded using the Process RV_A_H_E 2>2 so as to 
10 yield the output sequence y - y[l], . . . ,y[n], . . ., such that y[i] £ A, 
i = 1, . . . ,n, . . then 



lim E[/(n)] = /* , and 



Tl~>CX3 



lim Var[/(n)] = 0, 



n— >oo 



(8) 
(9) 



3 The modus operandus of the proof is slightly different from that of Theorem 4. It 
rather follows the proofs of the L R p scheme of automata learning (see Lakshmivarahan 
(Learning Algorithms Theory and Applications. Springer- Verlag, New York, (1981)), 
and Narendra et al (Learning Automata. An Introduction, Prentice Hall, (1989))). 



The quantity E 



/(n + 1) 



/(») 



in terms of f(n) is first computed. The expectation 



is then taken a second time and E 



f(n + l)j is solved by analyzing the difference 

equation. It turns out that this difference equation is linear ! 

4 The price that is paid for randomizing the solution to DODC is that the complexity 
of the analysis increases; /(n) is now a random variable whose mean converges to /* 
and variance converges to zero. Mean square convergence, and consequently, converge 
in probability, are thus guaranteed. 
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where /* = 0.5 is the requested probability of 0 in the output, and f(n) = 



5 When RV_AJH_E 2j 2 is generalized to the Huffman tree maintained in 
RV_A_H_E mj2) the latter satisfies the sibling property even though its 
structure may change at every time l n\ As before, the proof of conver- 
gence of RV_AJH_E m? 2, relies on the convergence properties of the binary 
input alphabet scheme, RVJVJ3JE 2)2 . The convergence result is stated 

10 and proved in Theorem 8. Crucial to the argument is the fact that 
the estimated probability of 0 in the output can vary within the range 



Theorem 7 (Convergence of Process RV_AJHJE m2 ). Consider a 
memoryless source whose alphabet is S = {si, . . . , s m } and a code al- 
15 phabet, A = {0, 1}. If the source sequence X — x[l], . . . , x[M), . . with 
x[i] G 5, i = 1, . . . >n, . . is encoded using the Process RVJVJHJC m)2 so 
as to yield the output sequence y = y[l], . . . , y[R], . . such that y[i] £ A, 
i = 1, . . . , J?, . . then 



with co(n) being the number of 0's encoded up to time n. Thus f(n) 
converges to /* in the mean square sense, and in probability. □ 




20 



hm E[f{n)} = , and 
lim Var[/(n)] = 0, 



(10) 
(11) 
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where /* = 0.5 is the requested probability of 0 in the output, and f(n) = 
w ith co(n) being the number of O's encoded up to time n. Thus f(n) 
converges to /* in the mean square sense and in probability. □ 

5 Corollary 4. The Process RV_AJH_E m , 2 guarantees Statistical Perfect 
Secrecy. ^ 

RF^A._HJE m2 : A Randomized Embodiment not Utilizing f(n) 
The above randomized embodiment, RV_A_H_E m , 2 , was developed by 
comparing the invoked random number with f(n). Since the mean of 
10 the random number is 0.5, and since convergence to a value /* = 0.5 is 
intended, it is easy to modify RV_A_H_E m , 2 so that the random number 
invoked is compared to the fixed value 0.5, as opposed to the time- varying 
value f(n). 

Indeed, this is achieved by modifying Process RV_A_H_E m ,2 by merely 
15 changing the comparison: 

if RNG[nextjrandom_number] > f{n) then // Assignment 0-1 

to: 

if RNG[next_random_number] > 0.5 then // Assignment 0-1. 
Since the dependence of the branch assignment rule does not depend 
20 on f(n) the convergence rate degrades. However, this modified process, 
RF_A_H_E m>2 , is computationally more efficient than RV_A JH_E m , 2 , since 
it avoids the necessity to constantly update f(n). 
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The convergence and Statistical. Perfect Secrecy properties of the 
RF_A_HJE m)2 are easily proven. 

The formal description of the processes RF_A _HJ3 mi 2 and the 
RFJVJHLD m> 2, and the corresponding proofs of their properties are omit- 
5 ted to avoid repetition. 

RR_A_HJE m2 : An Embodiment Utilizing an /(n)-based Ran- 
dom Variable 

As observed, the above randomized embodiment, RV_A_H_E mj2) was 
developed by comparing the invoked random number, a, with /(n), 

10 which comparison was avoided in RFJVJH_E mj2 - 

A new process, RRj^JH_E mj 2, is now designed in which the random 
number invoked, a, is not compared to f(n) but to a second random 
variable whose domain depends on f(n) . 

Observe that the mean of the random number in [6,1] is 0.5. Since 

15 convergence to a value /* = 0.5 is intended, RR^A_H_E m5 2 is developed 
by forcing /(n) to move towards 0.5 based on how far it is from the fixed 
point. But rather than achieve this in a deterministic manner, this is 
done by invoking two random variables. The first, a, as in RR_A_H_E mj 2 
returns a value in the interval [0,1]. The second, a 2 , is a random value 

20 in the interval [/(n), 0.5] (or [0.5, /(n)], depending on whether f(n) is 
greater than 0.5 or not). The branch assignment is made to be either 
0 - 1 or 1 - 0 randomly depending on where a is with regard to 
The rationale for this is exactly as in the case of RVjV_HJE mj2j except 
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that the interval [/(n), 0.5] (or [0.5, /(n)]) is amplified to ensure that the 
convergence is hastened, and the variations around the fixed point are 
minimized. Figure 19 clarifies this. 

These changes are achieved by modifying Process RV_AJH_E m}2 by 
5 merely changing the comparisons : 
if RNG[next lxandom_number] > f(n) 

then // Assignment 0-1 

to: 

if RNG[nextj:andomjiumber] > RNG[next_randomjiumber (/(n), 0.5) ] 
10 then // Assignment 0-1. 

It should be observed that this process, RR_AJHJE mj 2, is computa- 
tionally less efficient than RV_A JHJE m ,2 because it requires two random 
number invocations. However, the transient behavior is better, and the 
variation around the fixed point is less than that of RV JV_H JE mj 2- 
15 The convergence and Statistical Perfect Secrecy properties of the 
RR_A_H_E m)2 are easily proven. 

The formal description of the processes RR_A_H_E mj 2 and the 
RR_AJLD m) 2> and the corresponding proofs of their properties are omit- 
ted to avoid repetition. However, the experimental results involving these 
20 processes will be included later. 

Empirical Results 

As before, the Processes RV_A_H_E mj2 , RF_A_HJE m)2 and the 
RRj^_H-E m) 2, and their respective decoding counterparts, have been rig- 
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orously tested on files of the Calgary corpus and the Canterbury corpus. 
The empirical results obtained for these runs for RV_A_H JE m>2 are shown 
in Tables 8 and 9 respectively. 
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It ah 
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bookl 
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o 1 1 ci cP no 

3.11515h-08 


bookz 
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1. ( LZOQ£j-\J t 


geo 


u.ozuuoyt/yo 






news 


0.534758020 


0.500035467 


1.25789E-09 


objl 


0.527805217 


0.498309924 


2.85636E-06 


obj2. 


0.527944809 


0.499897080 


1.05926E-08 


paperl 


0.535224321 


0.498786652 


1.47221E-06 


progc 


0.535104455 


0.499165603 


6.96219E-07 


progl 


0.535598140 


0.500651864 


4.24927E-07 


progp 


0.535403385 


0.499089276 


8.29419^07 


trans 


0.533495837 


0.499352913 


4.18722E-07 


Average 






1.77154E-07 



Table 8: Empirical results obtained from RV_AJHJE m)2 and TAH which 
were tested files of the Calgary corpus, where /* = 0.5. 

The second column, corresponds to the estimated probability of 

5 0 in the output obtained from the Traditional Adaptive Huffman scheme 
(TAH). The third column, fuvAHEi contains the estimated probability 
of 0 in the output obtained from running RV_A_H_E m)2 . The last column 
corresponds to the distance between '/* = 0.5 and Jrvahe, calculated as 
in (14). 

10 Observe the high accuracy of RV_A_H _E mj2 for all the files. Although 
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File Name 
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0 534540728 


0 498457539 


2 37919E-06 


grammar Isd 


0 537399558 


0.500566251 


3.20641E-07 


kennedy.xls . 


0.535477484 


0.500015798 


2.49567E-10 


lcetl0.txt 


0.538592652 


0.500311976 


9.73291E-08 


plrabnl2.txt 


0.542479709 


0.500325952 


1.06245E-07 


ptt5 


0:736702657 


0.500339382 


1.15180E-07 


xargs.l 


0.536812041 


0.500181340 


3.28842E-08 


Average 






1.17159E-07 



Table 9: Empirical results obtained from running RV_A JH JE mj2 and TAH 
on the files of the Canterbury corpus, where /* = 0.5. 

the performance of RV_A_BLE m>2 is comparable to that of D_A J3_E m|2j 
the average distance is less than 2.0E-07 for files of the Calgary corpus 
and the Canterbury corpus. Note also that the largest value of the dis- 
tance for the files of the Calgary corpus and the Canterbury corpus is 
5 less than 1.5E-06. 

A graphical analysis of the RV _A _H_E m>2 is presented by plotting the 
value of d^T.T) obtained by processing the file bib of the Calgary corpus. 
The plot of the distance for this is depicted in Figure 20. 

Note how the estimated probability of 0 converges very rapidly (al- 
io though marginally slower than D_A_HJE mi2 ) to 0.5. This is reflected in 
the fact that the distance remains arbitrarily close to 0 as n increases. 
Similar results are available for the other files of the Calgary corpus and 
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for files of the Canterbury corpus, and are omitted. 

Encryption Processes Utilizing DDODE and 
RDODE Solutions 

This section describes how the deterministic and randomized embod- 
5 iments of DODE can be incorporated to yield various encryption strate- 
gies. 

As mentioned earlier, the notation used is that DDODE is a generic 
name for any Deterministic encoding solution that yields Statistical Per- 
fect Secrecy. Similarly, RDODE is a generic name for any Randomized 
10 encoding solution that yields Statistical Perfect Secrecy. 

Elimination of Transient Behavior 

There are two ways to eliminate the transient behavior of any DDODE 
or RDODE process. These are explained below,. 

In the first method, the Sender and the Receiver both initialize n 
15 to have a large value (typically about 1000) and also assign f(n) to be 
0.5. Since the initial solution for f(n) is at the terminal fixed point, the 
process is constrained by the updating rule to be arbitrarily close to it. 
This eliminates the transient behavior. 

In the second method, the input sequence X is padded in a prefix 
20 manner with a simple key-related string. This string could be, in the 
simplest case, a constant number of repetitions of the key itself. In more 
elegant implementations, this prefix string could be multiple number of 
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predetermined variations of a string generated from the key. It should 
be noted that since the Sender and the Receiver are aware of the key, 
they are also aware of the prefix padded portion. This key-related prefix 
is called the TransientPrefix. The only requirement that is imposed on 
5 TransientPrefix is that it must be key related, known a priori to the 
Sender and Receiver, and must be of sufficient length so that after it 
is processed, the system enters into its steady state, or non-transient 
behavior. Typically, a string of a few thousand bits guarantees this 
phenomenon. 

10 The issue of eliminating the transient is solved as follows. The Sender 
first pads X with TransientPrefix, to generate a new message ^ r p em p J 
which is now processed by the encryption that utilizes either the process 
DDODE or RDODE. Before transmitting its coded form, say, ^ , 
the Sender deletes TransientPrefix, the prefix information in ^pemp' 

15 which pertains to the encryption of TransientPrefix, to yield the resultant 

string ^pemp' ^Temp ^ oes n0 ^ ^ ave an ^ transient characteristics - it 
converges to 0.5 immediately. The Receiver, being aware of this padding 
• information and the process which yielded ^pemp' pac ^ s ^Temp w ^ 
TransientPrefix to yield <-£pemp' whence the decryption follows. 
20 The formal procedure for achieving this is straightforward and omit- 
ted in the interest of brevity. In future, whenever reference is made to 
encryptions involving DDODE or RDODE, they refer to the versions in 
which the transient behavior is eliminated. 
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Serial use of DDODE or RDODE with Traditional Encryptions 
Both DDODE and RDODE can be used to enhance encryption sys- 
tems by utilizing them serially, in conjunction with any, already existing, 
encryption mechanism. 
5 Two encryption methods, DODE" and DODE+ respectively, based on 
a solution to the DODE are presented. The solution to the.DODE can 
be either deterministic (DDODE) or randomized (RDODE). 

1. DODE" uses a DDODE or RDODE process as an encoding mecha- 
nism in conjunction with any encryption, say ENCNonPreserve, that 

10 does not preserve the input-output random properties. Observe that, 
without a key specifying mechanism, both DDODE and RDODE 
can be used as encoding processes that guarantee Statistical Perfect 
Secrecy. If this output is subsequently encrypted, the system can- 
not be broken by statistical methods because DDODE and RDODE 

15 annihilate the statistical information found in the original plaintext, 
and simultaneously provides Stealth. However, the output of the se- 
rial pair will depend on the stochastic properties of ENCNonPreserve- 
ENC NonP reserve, the encryption operating in tandem with DDODE 
or RDODE can be any public key cryptosystem or private key cryp- 

20 tosystem (see Stinson (Cryptography : Theory and Practice, CRC 
Press, (1995))). 

2. DODE + uses a DDODE or RDODE process as an encoding mech- 
anism in conjunction with any encryption that preserves the input- 
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output random properties, say, ENCp re serve- Typically, any "good* 
substitution or permutation encryption achieves this (one such straight- 
forward process for ENC PTeserve is given in Appendix C). From the 
earlier discussions it is clear that without a key specifying mecha- 

5 nism, both DDODE and RDODE can be used as encoding processes ■ 
that guarantee Statistical Perfect Secrecy. If this output is subse- 
quently encrypted by ENCp reseTve , the system cannot be broken by 
statistical methods because DDODE and RDODE annihilate the 
statistical information found in the original plaintext. Observe that 

10 since ENCp Teserve preserves the input-output random properties, 
one can expect the output of the tandem pair to also guarantee Sta- 
tistical Perfect Secrecy. Thus, breaking DODE+ using statistical- 
based cryptanalytic methods is impossible, and necessarily requires 
the exhaustive search of the entire, key space. DODE+ also provides 

15 Stealth. 

RDODE* : Encryption Utilizing RDODE 

This section describes how RDODE, any randomized embodiment for 
DODE, can be utilized for encryption. The encryption is obtained by 
incorporating into RDODE a key specifying mechanism. 
20 Unlike DODE~ and DODE 4 ", in which the two processes of com- 
pression (expansion or data length specification in the case of general 
Oommen-Rueda Trees) and encryption operate serially on each other, in 
the case of RDODE*, it is not possible to decompose RDODE* into the 
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two composite mutually exclusive processes. They augment each other 
in a non-decomposable manner. 

RDODE* uses a private key. This key 5 is used to specify p, the initial 
seed used by the Random Number Generator invoked by the specific 
5 randomized solution, RDODE. The mapping from the key to the seed is 
now described. 

Formally, the key, JC, is a sequence of symbols, A;[l] ...k[T], where 
Vi, k[i] G B = {&i, . . . , b t }. The key uniquely specifies the initial seed 
used by the pseudo-random generator, by means of a transformation of JC 

10 into an integer. There are numerous ways by which this transformation 
can be defined. First of all, any key, JC, of length T can clearly be 
transformed into an integer in the range [0, . . . , t T - 1] . Furthermore, since 
the symbols of the key can be permuted, there are numerous possible 
transformations that such a key-to-integer mapping can be specified. One 

15 such transformation procedure is proposed below. 

Consider a key, JC = k[l]...k[T) (where k[i) G B), which is to be 
transformed into a seed, B. First, JC is transformed into an sequence of 
integers, where each symbol of B is a unique integer between 0 and t - 1. 
This sequence is of the form JC' = k'[l]. . .k'[T], such that Vi, k'[i\ G 

20 {0, ... ,r - 1}. After this transformation, the seed 6 , B, is calculated as 

5 The key alphabet can contain letters, numbers, special symbols (such as '*', ')'. 
'#'), the blank space, etc. In a developed prototype, the key alphabet consists of 
{a, ... ,2} U {,4, ...,£} U {0, ... ,9} U {space,'.'}. 

6 It should be noted that this conversion yields the seed as an integer number, it 
the system requires a binary seed, the latter can be trivially obtained from the binary 
expansion of j3. 
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follows: 

/?=£>'[i+l]t\ (12) 

i=0 

5 It can be trivially seen that for each transformed key, fc'[l] ...k'[T], 
there is a unique integer, ,6 in the set {0, ...,t T - 1}, and hence if 
(3 < Zmax, Statistical Perfect Secrecy can be achieved, where Z max is the 
maximum possible seed of the pseudo-random number generator. Ob- 
serve too that both the Sender and Receiver must have the same initial 

10 seed, (3, so as to recover the original source sequence. 

After invoking the RNG using P, the value of the seed /?, is updated. 
This implies that the labeling branch assignment to yield the n th bit of the 
output, is determined by the current pseudo-random number generated 
using the current value of /?. Thus, same sequences of pseudo-random 

15 numbers are utilized by the encryption and decryption mechanisms. 

In any practical implementation, the maximum number that a seed 

• can take depends on the precision of the system being used. When (3 > 
Zmax, where Z max is the maximum seed that the system supports, p 
can be decomposed into J seeds, Z\ t . - . , Zj, so as to yield J pseudo- 

20 random numbers. The actual pseudo-random number, (say a), to be 
utilized in the instantiation of RDODE is obtained by concatenating the 
J random numbers generated at time n, each of which invokes a specific 
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instantiation of the system's RNG. Thus, if the encryption guarantees 
192-bit security, it would imply the invocation of four 48-bit pseudo- 
random numbers, each being a specific instantiation of the system's RNG. 
A small example will help to clarify this procedure. 

5 Example 6. Suppose that a message is to be encrypted using the key 
K = lt The Moon" , which is composed of symbols drawn from the alpha- 
bet 8 that contains 64 symbols: decimal digits, the blank space, upper 
case letters, lower case letters, and The transformation of JC into KJ 
is done as follows: 

10 • the digits from 0 to 9 are mapped to the integers from 0 to 9, 

• The blank space is mapped to 10, 

• the letters from A to Z are mapped to the integers from 11 to 36, 

• the letters from a to z are mapped to the integers from 37 to 62, 
and 

15 • the period, is mapped to 63. 

Therefore, K! =30 44 41 10 23 51 51 50. There are 64 8 = 281, 474, 976, 710, 656 
possible seeds, which are contained in the set {0, . . . , 64 s - 1}. 
The seed, (3, is calculated from /G as follows: 

P = (30)64 0 +(44)64 1 +(41)64 2 M-(10)64 3 +(23)64 4 +(51)64 5 + (51)64 6 

20 +(50)64 7 

= 223,462,168,369,950. 
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Observe that such a key-to-seed transformation is ideal for an RNG 
implementation working with 48-bit accuracy, as in the Java program- 
ming language (jdk l.xx). In such a case, a key of length eight chosen 
from the 64-symbol key space leads to a seed which is uniform in the 
5 seed space. □ 

We conclude this section by noting that by virtue of the previous 
results, every instantiation of RDODE* guarantees Statistical Perfect 
Secrecy. Thus, breaking RDODE* using statistical-based crypt analytic 
methods is impossible, and necessarily requires the exhaustive search of 
10 the entire key space. 



95 



SUBSTITUTE SHEET (RULE 26) 



WO 03/028281 



PCT/CA01/01429 



Testing of Encryptions Utilizing RV_A_HJE m>2 and RR_AJHJE m>2 
To demonstrate the power of the encryption in which the transient 
behaviour is eliminated, and the seed is key-dependent, RV_A_HJE m)2 
and RR_A_H_E mj2 have been incorporated into a fully operational 48- 
5 bit prototype cryptosystem. Both of these been rigouously tested for a 
variety of tests, and these results are cataloged below. The key-to-seed 
specifying mechanism is exactly as explained in the above example. 

The results for testing RV_A -HJE mj2 on the Calgary Corpus and Can- 
terbury are given in Tables 10 and 11 respectively. Observe that the value 
10 of distance as computed by (14) is arbitrarily close to zero in every case. 
Similar results for RR_A_H J3 m)2 on the Calgary Corpus and Canterbury 
Corpus are given in Tables 12 and 13 respectively 

The variation of the distance as a function of the encrypted stream 
has also been plotted for various files. Figure 21 graphically plots the 
15 average distance obtained after encrypting file bib (from the Calgary 
Corpus) using the Process RV_A_HJE m>2 as the kernel. The similar figure 
in which Process RFLA J3-E m , 2 is used as the kernel is found in Figure 
22. Observe that the graphs attains their terminal value close to zero 
right from the outset - without any transient characteristics. 
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File Name 


Itah 


fROHA 




bib 


0.534890885 


0.498926052 


1.15336E-06 


bookl 


0.539382961 


0.500425219 


1.80811E-07 


book2 


0.536287453 


0.500349415 


1.22091E-07 


geo 


0.526089998 


0.501331183 


1.77205E-06 


news 


0.534758020 


0.499712212 


8.28219E-08 


objl 


0.527805217 


0.4997762 


5.00864E-08 


obj2 


0.527944809 


0.499521420 


2.29039E-07 


paperl 


0.535224321 


0.500632728 


4.00345E-07 


progc 


0.535104455 


0.499916321 


7.00218E-09 


progl 


0.535598140 


0.499278758 


5.20190E-07 


progp 


0.535403385 


0.499946669 


2.84420E-09 


trans 


0.533495837 


0.499784304 


4.65248E-08 


Average 






2.74079E-07 



Table 10: Empirical results obtained after encrypting files from the Calgary 
Corpus using the Process RV^A_H_E m>2 as the kernel. The results are also 
compared with the Traditional Adaptive Huffman (TAH) scheme. 



97 



SUBSTITUTE SHEET (RULE 26) 



WO 03/028281 



PCT/CAO1/01429 



File Name 


Itah 


fROHA 




alice29.txt 


0.543888521 


0.499315030 


4.69184E-07 


asyoulik.txt 


0.538374097 


0.500483773 


2.34036E-07 


cp.html 


0.537992949 


0.498694099 


1 70538E-06 


fields, g 


0.534540728 


0.501490468 


2.22149E-06 


grammar.lsp 


0.537399558 


0.505635555 


2.22149E-06 


kennedy.xls 


0.535477484 


0.500505392 


3.17595E-05 


lcetl0.txt 


0.538592652 


0.499913752 


2.55421E-07 


plrabnl2.txt 


0.542479709 


0.499932933 


7.43872E-09 


ptt5 


0.736702657 


0.500238738 


5.69958E-08 


xargs.l 


0.536812041 


0.505213528 


2.71809E-05 


Average 






1.16081E-05 



Table 11: Empirical results obtained after encrypting files from the Can- 
terbury Corpus using the Process RV_A_HJE mj2 as the kernel. The re- 
sults are also compared with the Traditional Adaptive Huffman (TAH) 
scheme. 

Output Markovian Modeling and Independence Analysis 

The theoretical analysis of convergence and the empirical results dis- 
cussed for RV_A_H_E mj2 were done by merely considering the probabil- 
ities of occurrence of the output symbols. The question of the depen- 
5 dencies of the output symbols is now studied using a Markovian analysis 
and a x 2 hypothesis testing analysis. To. achieve this goal, the possibility 
of the output sequence obtained by RV J\._H_E m>2 being a higher-order 
Markov model is analyzed. This model, which considers conditional prob- 
abilities of a symbol given the occurrence of the k previous symbols, is 
10 called a fc t/l -order model. 

Consider the first-order model with the following conditional probabil- 

98 



SUBSTITUTE SHEET (RULE 26) 



WO 03/02X281 



PCT/CA0 1/0 1429 



File Name 


IT AH 

J J. XX Si 


f P /7 >4 


dip,?*) 


bib 


0 534890885 


0 498730944 


1 61050E-06 


bookl 


0 539382961 


0 499440591 


3 1293SE-07 


book2 


0 536287453 


6 499593621 


1 65144E-07 




0 526089998 


0 500380707 


1 44938E-07 


news 


0 534758020 


0 499766932 


5 43207E-08 


obil 


0 527805217 


0 500416731 

u • uu wo: jl w i ox 


1 73665E-07 

X.I yUvJJ-TU 1 


obj2 


0.527944809 


0.499376045 


3.89320E-07 


paper 1 


0.535224321 


0.498987636 


1.02488E-06 


progc 


0.535104455 


0.500757891 


5.74399E-07 


progl 


0.535598140 


0.499958084 


1.75695E-09 


progp 


0.535403385 


0.500176402 


3.11177E-08 


trans 


0.533495837 


0.498818445- 


1.39607E-06 


Average 






3.5628E-07 



Table 12: Empirical results obtained after encrypting files from the Calgary- 
Corpus using the Process RR_AJHJE mj2 as the kernel. The results are 
also compared with the Traditional Adaptive Huffman (TAH) scheme. 

ities, /o|o( n ) an d /i|o( n ) (where the latter is equal to 1 - /o|o( n ))> /o|i( n )> 
an d A|i(n) (which in turn, is 1 - /o|i(n)). These probabilities can also 
be expressed in terms of a 2 x 2 matrix form, the underlying transition 
matrix. Since the asymptotic probabilities are [0.5,0.5], it is easy to see 

that this matrix has to be symmetric of the form , where 

1-5 5 J * 

8 = 0.5 is the case when there is no Markovian dependence. The value 
of S is estimated from the output sequence in a straightforward manner. 

In order to analyze the independence of the symbols in the output, 
RV_A_HJE mj 2 has been tested on files of the Calgary corpus and the 
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File Name 


Jtah 


fRROHA 




alice29.txt 


0.543888521 


0.499618479 


1.45558E-07 


asyoulik.txt 


0.538374097 


0.499708422 


8.50177E-08 


cp.html 


0.537992949 


0.500003807 


1.44932E-11 


fields, c 


0.534540728 


0.502928943 


8.57871E-06 


grammar. lsp 


0.537399558 


0.499433748 


3.20641E-07 


kennedy.xls 


0.535477484 


0.499559689 


1.93874E-07 


lcetl0.txt 


0.538592652 


0.499937297. 


3.93167E-09 


plrabnl2.txt 


0.542479709 


0.500082543 


6.81335E-09 


ptt5 


0.736702657 


0.500392045 


1.53699E-07 


xargs.l 


0.536812041 


0.504488168 


2.01437E-05 


Average 






1.92746E-07 



Table 13:' Empirical results obtained after encrypting files from the Can- 
terbury Corpus using the Process RFLA _H -E m) 2 as the kernel. The re- 
sults are also compared with the Traditional Adaptive Huffman (TAH) 
scheme. 

Canterbury corpus. The requested probability of 0 in the output is /* = 
0.5. The estimated probability of 0 given 0 in a file of R bits is obtained 
as follows : / 0 |0 — °^(r) ? where Cq\q(R) is the number of occurrences of 
the sequence '00' in 3^, and cq(R) is the number of 0's in y. Analogously, 

5 is calculated. 

Assuming that y is a sequence of random variables, the analysis of 
independence in the output is done by using a Chi-square hypothesis test 
(see Snedecor et al. (Statistical Methods, Iowa State University Press, 
8th edition, (1989))). The empirical results obtained from executing 

10 RV_A JH_E m) 2 on files of the Calgary corpus and the Canterbury corpus 
are shown in Tables 14 and 15 respectively. 
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X Xl\s XXCtlli-V-' 


/o|o+/i|i 


y 2 


Decision (98%) 


DID 


U.^tyy 1000 


U.UUUUu / ZO 


Inn on 

inaep. 


bookl 


u.^±yy * on 


n nnnnni fifi 

U.UUUUUIOO 


Indep. 


U AA 1 r O 

DOOK^ 


u.^tyyoioy 


n nnnnni Gfi 
u.uuuuuiy.o 


Indep. 


geo 


u.^yyyoo/ 


n nnnnnnQR 
u.uuuuuuyo 


muep. 


news 


u.^yyz i oy 


n nnnnnA9^ 

U.UUUUUftZO 


Indep. 


ODjl 


U.OUUOOOO 


n nnnnn^na 
u.uuuuuouo 


Indep. 


UUJ^ 


0 ^005717 


n 00000315 


TnHpn 

111VJ.CJJ. 


paperl 


0.4995268 


0.00000367 


Indep. 


progc 


.0.5009482 


0.00001100 


Indep. 


progl 


0.4977808 


0.00005315 


Indep. 


progp 


0.4994945 


0.00002204 


Indep. 


trans 


0.4999502 


0.00000060 


Indep. 



Table 14: Results of the Chi-square test of independence in the output of 
the encryption that uses RV_A_H-E m> 2 as the kernel. The encryption was 
tested on files of the Calgary corpus with /* = 0.5. 

The second column represents the average between / 0 |o an d /i|i which 
are calculated as explained above. The third column contains the value 
of the Chi-square statistic for f ai \ ap where a*, aj are either 0 or 1, and the 
number of degrees of freedom is unity. The last column stands for the 

5 decision of the. testing based on a 98% confidence level. Observe that for 
all the files of the Calgary corpus and the Canterbury corpus the output 
random variables are independent This implies that besides converging 
to the value 0.5 in the mean square sense, the output of RV_A _HJE m i2 is 
also statistically independent as per the 98% confidence level. 

10 Similar results are also .available for higher-order Markovian models 
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File name 


/oio+/i|i 
2 


A 


Decision (98%! 




U.JUUIOOU 


0 00000019 


Indep. 


^Qvrvnlilr ^vf" 


U.UUUrtc/OO 


0 00001 fiOO 
u.uuuuxuuu 


Indep. 


/-•TV n^lTll 
v_-|J. JlUllII 


0 ^00R0Q7 

U.tJUUOUiy / 


0 00001 01 0 
U.UUUU1U1U 


Indep. 


tipIHq p 

1IC1U.CJ .L» 


n ^oikim 


0 00001 1 0^ 


Indep. 


grammar Isd 


0 5022818 


0 00008135 


Tnrlp'n 


kennedy.xls 


0.4998954 


0.00000009 


Indep. 


lcetl0.txt 


0.5006621 


0.00000154 


Indep. 


plrabnl2.txt 


0.4999315 


0.00000013 


Indep. 


ptt5 


0.4993704 


0.00000320 


Indep. 


xargs.l 


0.4915672 


0.00057094 


Indep. 



Table 15: Results of the Chi-square test of independence in the output of 
the encryption that uses RVJVJHJE m|2 as the kernel. The encryption was 
tested on files of the Canterbury corpus with /* = 0.5. 

up to level five, in which the output of RV_AJH_E mj2 proves to be also 
statistically independent as per the 98% confidence level. 

The FIPS 140-1 Statistical Tests of Randomness of RDODE* 

The results of the previous two subsections demonstrate the power of 
5 the encryptions for Statistical Perfect Secrecy (convergence to /* = 0.5) 
and the Markovian independence. However, it is well known that a good 
cryptosystem must endure even most stingent statistical tests. One such 
suite of tests are those recommended by the standard FIPS 140-1 (see 
Menezes et al (Handbook of Applied Cryptography, CRC Press, (1996)) 
10 pp. 181-183). These tests include the frequency test, the poker test, the. 
runs test, and the long runs test As suggested these tests have been run 
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for encrypted strings of lengh 20000 bits with the corresponding statistics 
for passing the tests are clearly specified. The empirical results obtained 
after encrypting the files of the Calgary Corpus and the Canterbury 
Corpus using RVJ^_H_Em, 2 as the kernel are cataloged in Tables 16, 17, 
5 18, 19, 20, 21, 22, 23, -24, and 25, respectively. Notice that the encryption 
passes all the tests. Similar results are available for the encryption which 
uses RR_AJH_Era, 2 as the kernel. 
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P LlKs IlclilltJ 






/t iniax 


DID 






1U,040 


bookl 


y,0D4 


Q Q7A 

y,y / 4 


in fiAfi 


DOOKZ 


Q fi^A 


Q GQ1 

y,yy± 


i n RAf\ 


geo 




in nno 


i n GAR 


news 




in nnn 


i n fiAfi 


ODJl 


0 £^A 


in n^3 


1 n fiAfi 


uujZ 




i/ jOUU 




paperl - 


9,654 


9,958 


10,646 


progc 


9,654 


10,006 


10,646 


progl 


9,654 


10,072 


10,646 


progp 


9,654 


9,973 


10,646 


trans 


9,654 


10,046 


10,646 



Table 16: Results of the Monobit test in the output of the encryption that 
uses RV JV_HJE m) 2 as the kernel The encryption was tested on files of 
the Calgary corpus. 
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File name 




< ri\ < 


77,1 m 53 v 




Q fi^4 


■ 1 n has 


If) fi4fi 


QCnrnnlil/' "f"vi* 

doy u uiiiv . la u 


Q fi^4 


ID 09Q 




L/p.lluIIll 


0 fi^zL 




10 fi/lfi 




y jOuft 


y ,y < ft 


1 0 fiAfi 


IVV^i-lxxC/vJ. V .AID 


Q 654 




X U j UtU 


lcetl0.txt 


9,654 


10,131 


10,646 


plrabnl2.txt 


9,654 


9,926 


10,646 


ptt5 


9,654 


10,137 


10,646 


sum 


9,654 


10,046 


10,646 


xargs.l 


9,654 


9,989 


10,646 



Table 17: Results of the Monobit test in the output of the encryption that 
uses RV_A_HJ3 mj 2 as the kernel. The encryption was tested on files of 
the Canterbury corpus. . 
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JT 11C llCblllv/ 






Y 

^ ^olllcLA. 


DID 


1 03 


i3 9^ 


^7 A 


bookl 


i 03 

l.UO 


9R 74 


57 4 


DOOK/ 


1 03 
l.UO 


10 33 

1U.OO 


57 4 


geo 


1 03 

l.UO 


13 ^0 

lO.OU 


57 4 


news 


1 03 


JLO-UU 




0DJ1 


l.UO 


Q 98 


57 4 




1 03 


11.00 


57.4 


paperl 


1.03 


21.72 


57.4 


progc 


1.03 


11.15 


57.4 


progl 


1.03 


13.67 


57.4 


progp 


1.03 


7.71 


57.4 


trans 


1.03 


14.30 


57.4 



Table 18: Results of the Poker test (when m = 4) in the output of the 
encryption that uses RVJV -HJ3 7n ,2 as the kernel. The encryption was 
tested on files of the Calgary corpus. " 
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File name 


^3min 


< X 3 < 


^3max 


alice29.txt 


1.03 


12.34 


57.4 


asyoulik.txt 


1.03 


18.19 


57.4 


cp.html 


1.03 


8.85 


57.4 


fields.c 


1.03 


17.81 


. 57.4 


kennedy.xls 


1.03 


14.31 


57.4 


lcetl0.txt 


1.03 


12.16 


57.4 


plrabnl2.txt 


1.03 


23.69 


57.4 


ptt5 


1.03 


11.45 


57.4 


sum 


1.03 


10.44 


57.4 


xargs.l 


1.03 


16.25 


57.4 



Table 19: Results of the Poker test (when m = 4) in the output of the 
encryption that uses RV_A_H_B m , 2 as the kernel. The encryption was 
tested on files of the Canterbury corpus. 
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File name 


o 

t-run 








n i/ ^tmax 




1 


2,267 


2,482 


2,445 


2, 733 




2 


1,079 


1,241 


1,214 


1,421 


bib 


3 
4 


502 
223 


612 
293 


672 

6li 


74o 
4U2 




c 
0 


on 

yu 


1 7Q 
1(0 


1 Afi 
140 


99*} 




6 


90 


1 A £ 

146 


lo4 


09Q 




1 


2,267 


2,540 


O C OA 

2,539 


2, 733 




2 


1,079 


1,305 


1,336 


1,421 


bookl 


3 


502 


628 


574 


HA A 

748 


4 


223 


264 


295 


j< A A 

402 




5 


90 


loo 


10U 


99Q 




6 


90 


143 


149 


AAO 




1 


2,267 


2,463 


2,532 


A TOO 

2, 733 




2 


1,079 


1,266 


1, 184 


1,421 


book2 


3 


502 


627 


636 


. 748 


4 


223 


328 


322 


A A A 

402 




5 


on 


loo 


ioy 






6 


90 


170 


170 


223 




1 


2,267 


2, 494 


2,517 


2, 733 




2 


1,079 


1,226 


1,251 


1,421 


geo 


3 


502 


660 


613 


748 


4 


• 223 


320 


289 


402 




5 


yu 


loo 


lo ( 


09') 




6 


aa 

90 


1 CO 

152 


1 CI 

lol 


AOO 




1 


2,267 


2, 532 


A C "7A 

2, 579 


O 700 

2, 733 




2 


1,079 


1,255 


1 A£ 1 

1, 261 


1,421 


news 


3 
4 


er aa 

502 
223 


643 

O 1 A 

319 


610 

A AO 

293 


1 A O 

748 

^ AA 

402 




e; 

0 


on 






223 




6 


90 


176 


154 


223 




1 


2,267 


2, 534 


2,510 


2,733 




2 


1,079 


1,257 


1,230 


1,421 


objl 


3 


502 


620 


637 


748 ' 


4 


223 


313 


344 


402 




5 


90 


149 


15S 


223 




6 


90 


156 


168 


223 



Table 20: Results of the Runs test in the output of the encryption that 
uses RVJ\.-HJE m> 2 as the kernel. The encryption was tested on files of 
the Calgary corpus. 
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Pile name 


0 

£run 








B%/ ^?imax 




i 
i 


0 9fi7 
Z, ZD / 


o 40fi 

Z, 4UO 


9 404 
z, *±j*± 


2 733 

Ci^ \ OO 




2 


1,079 


1,277 


1,241 


1,421 


obj2 


Q 

o 


OUZ 


D4U 


uio 


74 R 


A 
4 


990 
ZZO 


01 3 


987 

ZD 1 


402 




f; 

u 


Q0 


143 


164 


223 




a 
0 


on 
yu 


177 

1 1 f 


13S 


223 




1 


O 9fi7 
Z, ZD/ 


9 A97 
Z, 4Z f 


9 47fi 
Z, ft I o 


9 733 
Z, 1 oo 




2 


1,079 


1,328 


1, ZZO 


1, 4zl 


paperl 


Q 

0 


ouz 


ODU 


fi1 1 

Dll 


74R 
i to 


4 


990 


oog 

OUD 


00^ 

OOO 


409 




o 




161 


144 


223 




0 




1 ftl 


1 ^1 

101 


993 




1 


O 9£7 
Z, ZDi 


Z, 04U 


9 M ft 
Z, OlO 


Z, I oo 




z 




1, ZoU 


1 9RQ 
1, ZDO 


1 491 
1, 4Z1 




0 


OUZ 


AHA 
OU4 


fi1 R 
OlO 


748 


progc 


A 

4 


ZZo 


Qnn 
oUU 


9Q9 

zyz 


4UZ 




0 


on 
yu 


1 =i9 

lOZ 


1R7 
1U 1 






0 


on 
yu 


ioy 


ioy 


993 

ZZO 




i 
1 


Z, ZCw 


o c^n 
Z, ODU 


9 AQA 
Z, 4y4 


Z, / oo 




2 


1,079 


1,202 


1,259 


1,421 




o 
0 


OUZ 


OOU 


OOO 


74 r 

f 40 


progl 


A 

4 


OOQ 

ZZo 


OCT 
ZO I 


000 

OOO 


4UZ 




e. 
o 


on 


J. t;VJ 


176 


223 




0 


on 
yu 


147 

1*± 1 


149 


993 




1 


O 9fi7 
Z, ZD / 


9 t%R9 
Z, OoZ 


9 ^7fi 
Z, 0 / o 


9 703 
Z, < 00 




2 


1,079 


1,291 


1,288 


1,421 




o 
0 


OUZ 


OOU 


OU1 


74R 

1 ^±0 


progp 


4 


OOQ 
ZZo 


090 
OZO 


01 1 
Oil 


409 
4UZ 




5 


90 


152 


154 


223 




6 


90 


147 


131 


223 




1 


2,267 


2,457 


2,462 


2,733 




2 


1,079 


1,223 


1,245 


1,421 




3 


502 


642 


587 


748 


trans 


4 


223 


316 


312 


402 




5 


90 


146 


187 


.223 




6 


90. 


177 


161 


223 



Table 21: Results of the Runs test in the output of the encryption that uses 
RVJl JH_E m , 2 as the kernel. The encryption was tested on files of the Calgary cor- 
pus. 
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File name 


^run 


B i/ G im\n 


<Gi 


Bi< 


Bi/Gixnax 




1 


2,267 


2,529 


2,502 


2,733 




2 


1,079 


1,238 


1,292 


1,421 


nlir>fi9Q fvf 


3 
4 


502 
223 


632 
319 


631 
308 


748 
402 




5 


90 


173 


155 


223 




6 


90 


158 


170 


223 




1 


2,267 


2,473 


2,467 


2,733 




2 


1,079 


1,243 


1,268 • 


1,421 


dbyOUHK.LXL 


3 
4 


502 
223 


617 

332 


609 
296 


748 
402 




5 


90 


146 


151 


223 




6 


90 


149 


151 


223 




1 


2,267 


2,579 


2,604 


2,733 




2 


1,079 


1,291 


1,266 


1,421 


cp.niini 


3 


502 


596 


644 


748 


4 


223 


294 


274 


402' 




5 


90 


155 


146 


223 




6 


90 . 


147 


133 


223 




1 


2,267 


2,542 


2,593 


2,733 




2 


1,079 


1,314 


1,232 


1,421 




3 


502 


588 


621 


748 


4 


223 


307 


311 


402 




5 


90 


161 


170 


223 




6 


90 


161 


162 


223 




1 


2,267 


2,532 


2,559 


2,733 




2 


1,079 


1,259 


1,212 


1,421 


kennedy.xls 


3 
4 


502 
223 


626 
294 


624 
351 


748 
402 




5 


90 


174 


144 


223 




6 


90 


177 


169 


223 



Table 22: Results of the Runs test in the output of the encryption that uses 
RVj\.JiJE m>2 as the kernel. The encryption was tested on files of the Canterbury 
corpus. 
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T?i|f» namp 


^run 


B IG- • 


< Gi 


Bi < 


lyi I ( T."Tn a v 
-^t/ HIlcLX 




1 


Z, Z0* 




9 AA1 
Z, 440 


9 7QQ 
Z, / 00 




2 


1,079 


1,196 


1,224 


1,421 


lcetl0.txt 


o 
0 

4 


cno 
502 

ooo 

ZZo 


000 

314 


ci n 
010 

o40 


7/1 Q 

(4o 
40 Z 




0 


on 
yu 


1 ^fi 

• lOD 


10 / 


993 

ZZO 




c 
0 


on 
yu 


1 70 

l r U 


1 39 
10Z 


99°. 
ZZO 




1 
1 


O OC7 

Z, Zo/ 


O A C7 

Z, 4b / 


o C.OC 

Z, oZb 


O 700 
Z, / OO 




z 


- 1 n7n 
1,1)79 


1 Oft A 

1, Z94 


1 OCT 
1, Zol 


1 /I Ol 

1, 4Z1 


plrabnl2.txt 


o 

o 

A 


r no 
oUZ 

990 
ZZo 


oOl 

01 C 

olO 


c 1 n 
oly 

Zyo 


7/1 0 

/4o 

/1H9 
4UZ 




t; 


on 

yu 


1 fil 

1UJL 




99°. 

ZZO 




C 
0 


on 
yu 


1 7H 
1 (U 


10U 


993 
ZZO 




1 
1 


O 9C7 

Z, ZD/ 


9 CI Q 

Z, olo 


O /ICQ 

Z, 4oo 


O 700 
Z, /OO 




O 
Z 


1, Of y 


1 OOO. 

1, Zoy 


1, ZoO 


1 /101 

1, 4Z1 


ptt5 


o 
0 


cno 
50Z 


ceo 
boo 


ci n 
Oly 


7/1 O 

74o 


4 


ZZo 


OQ1 

Zol 


O01 

ool 


/i no 
4UZ 




c 
0 


on 
yu 


140 


1 79 
1 / Z 


993 
ZZO 




. 0 


90 


155 


1 /I K 

>14o 


OOO 

ZZo 




1 


2,267 


n con 

2, 589 


o eon 

2, 522 


O 'TOO 

2, 733 




2 


1,079 


1,257 


1,286 


1,421 


sum 




oOZ 


000 


coc 

oZo 


1 AO 

74o 


A 
4 


ooo 
ZZo 


Z9o 


oOo 


40Z 






Q0 


-l \J\J 




£t£tO 




6 


90 


159 


158 


223 




1 


2,267 


2,502 


2,508 


2,733 




2 


1,079 


1,241 


1,264 


1,421 


xargs.l 


3 
4 


502 
223 


654 
301 


629 
304 


748 
402 




5 


90 


152 


149 


223 




6 


90 


159 


153 


223 



Table 23: Results of the Runs test in the output of the encryption that uses 
RV_A_HJE m>2 as the kernel. The encryption was tested on files of the Canterbury 
corpus. 
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File name 


Max. run length 


Runs > 34 


bib 


13 


0 . 


bookl . 


12 


0 


book2 


14 


0 


geo 


14 


0 


news 


16 


0 


objl 


14 


0 


obj2 


13 


0 


paper 1 


15 


0 


progc 


17 


0 


progl 


16 


0 


progp 


17 


0 


trans 


14 


0 



Table 24: Results of the Long Runs test in the output of the encryption 
that uses RV JLH JE mj2 as the kernel. The encryption was tested on files 
of the Calgary corpus. 
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File name 


Max. run length 


Runs > 34 


alice29.txt 


15 


0 


asyoulik.txt 


16 


0 


cp.html 


13 


0 


fields x 


17 


0 


kennedy.xls 


14 


0 


lcetl0.txt 


17 


0 


plrabnl2.txt 


14 


0 


ptt5 


14 


0 


sum 


13 


0 


xargs.l 


15 


0 



Table 25: Results of the Long Runs test in the output of the encryption 
that uses RV_AJrI_E m , 2 as the kernel. The encryption was tested on files 
of the Canterbury corpus. 
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Statistical Tests of Key-Input-Output Dependency 

Apart from the FIPS 140-1 tests described above, the encryption 
which uses RV_A -H-E mj2 has also been rigorously tested to see how en- 
crypted output varies with the keys and the input. In particular, the 
5 probability of observing changes in the output when the key changes by 
a single bit has been computed. This is tabulated in Table 26. Observe 
that as desired, the value is close to 0.5 for all the files of the Calgary 
Corpus. Similarly, the probability of changing subsequent output bits 
after a given input symbol has been changed by one bit has also been 
10 computed. This value is also close to the desired value of 0.5 in almost all 
files. The results are tabulated in Table 27. Similar results are available 
for files of the Canterbury Corpus also. 
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File name 


P 


bib . 


0.499610 


bookl 


0.499955 


book2 


0.500052 


geo 


0.499907 


news 


0.499842 


objl 


0.499463 


obj2 


0.500309 


paperl 


0.500962 


progc 


0.498403 


progl 


0.500092 


pfogp 


0.499961 


trans 


0.499244 



Table 26: Statistical independence test between the key and the output 
performed by modifying the. key in one single bit on files of the Calgary 
corpus, p is the estimated prob. of change. 
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File name 


V 


bib 


0.492451 


bookl 


0.467442 


book2 


0.501907 


geo 


0.500798 


news 


0.500089 


objl 


0.477444 


obj2 


0.496427 


paperl 


0.481311 


progc 


0.484270 


progl 


0.471639 


progp 


0.490529 


trans 


0.491802 



Table 27: Statistical independence test between the input and the output 
performed by modifying one single bit in the input on files of the Calgary 
corpus, p is the estimated prob. of change. 



116 



SUBSTITUTE SHEET (RULE 26) 



WO 03/028281 



PCT/CAO 1/01429 



Applications of DDODE and RDODE 

With regard to applications, both DDODE and RDODE (and their 
respective instantiations) can be used as compressions/encodings in their 
own right. They can thus be used in both data transmission, data storage 
5 and in the recording of data in magnetic media. However, the major 
applications are those that arise from their cryptographic capabilities 
and include: 

• Traffic on the internet 

• E-commerce/e-banking 
10 • E-mail 

• Video conferencing 

• Secure wired or wireless communications 

• Network security To clarify this, let us assume that the files stored 
in a network are saved using one of the schemes advocated by this 

15 . invention. In such a case, even if a hacker succeeds in breaking the 
firewall and enters the system, he will not be able to "read" the files 
he accesses, because they will effectively be stored as random noise. 
The advantages of this are invaluable. 

Appart from the above, the application of DDODE and RDODE in 
20 steganography is emphasized. 
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Steganography is the ancient art of hiding information, which dates 
back from around 440 B.C (see Katzenbeisser et al. (Information Hiding 
Techniques for Steganography and Digital Watermarking, Artech House, 
(2000))). One of the most common steganographic techniques consists 
5 of hiding information in images. To experimentally demonstrate the 
power of RVjVJ3JE mj 2, the latter has been incorporated into a stegano- 
graphic application by modifying 256-gray scale images with the output 
of RDODE*. By virtue of the Statistical Perfect Secrecy property of 
RV_A_H_E mj 2, its use in steganography is unique. 

10 Rather than use a sophisticated scheme, to demonstrate a prima-facie 
case, the Least Significant Bit (LSB) substitution approach for image 
"carriers" has been used. This technique consists of replacing the least 
significant bit of the jf pixel (or, to be more specific, the byte that 
represents it) by the k th bit of the output sequence. The set of indices, 

15 {jij • • • > 3r}, where R is the size of the output (R < |J|, and \I\ is the 
size of the carrier image), can be chosen in various ways, including the, 
so called, pseudo-random permutations. Details of this techniques can 
be found in Katzenbeisser et al. (Information Hiding Techniques for 
Steganography and Digital Watermarking, Artech House, (2000)). 

20 Although this is a fairly simplistic approach to applying RV_A_H JE mj2 
in steganography, it has been deliberately used so as to demonstrate its 
power. An example will clarify the issue. In an experimental prototype, 
the well known "Lena" image has been used as the carrier file, which 
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carries the output of RV_AJHJE m)2 . The original Lena image and the 
image resulting from embedding the output obtained from encrypting 
the file fields. c of the Canterbury corpus are shown in Figure 23. Apart 
from the two images being visually similar, their respective histograms 
5 pass the similarity test with a very high level of confidence. Such results 
are typical. 

Observe that once the output of RV_A _H_E m> 2 has been obtained, it 
can be easily embedded in the carrier by the most sophisticated steganog- 
raphy tools (see Information Hiding Techniques for Steganography and 

10 Digital Watermarking^ Artech House, (2000)) currently available, and 
need not be embedded by the simplistic LSB pseudo-random permuta- 
tion scheme described above. 

The present invention has been described herein with regard to pre- 
ferred embodiments. However, it will be obvious to persons skilled in the 

15 art that a number of variations and modifications can be made without 
departing from the scope of the invention as described herein. 
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Appendix A 

Formal Statement of the DODE Open Problem 

Assume that a code alphabet A = {ai, . . . , a r } is given, and that the 
user' specifies the desired output probabilities (or frequencies) of each 
5 a,j E A in the compressed file by T* = {/*, . . . , /*}. Such a rendering 
will be called an "entropy optimal" rendering. On the other hand, the 
user also simultaneously requires optimal, or even sub-optimal lossless 
data compression. Thus, if the decompression process is invoked on a 
compressed a file, the file recovered must be exactly the same as the 
10 original uncompressed file. 

As stated in Hankerson et al (Introduction to Information Theory 
and Data Compression, CRC Press, (1998) pp. 75-79), the Distribution 
Optimizing Data Compression (DODC) problem 7 can be more formally 
written as follows: 

15 Problem 1. Consider the source alphabet, S = {si, . . . , s m } } with prob- 
abilities of occurrence, V = [pi,...,p m ]> where the input sequence is 
X = x[l] . . . x[M], the code alphabet, A = {ai, . . . , a r }, and the optimal 
frequencies, J 7 * = {/f , . . . , /*}, of each ay, j = 1, . . . , r, required in the 
output sequence. The output is to be a sequence, y = y[l] . . . y[R], whose 

20 probabilities are T = {/i,...,/ r }, generated by the encoding scheme 
</> : S C = {wi, . . . , ty m }, where ^ is the length of such that: 

7 In general, outside the context of compression, this problem is referred to as the Distribution 
Optimizing Data Encoding (DODE) problem. 
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(i) C is prefix, 



(ii) the average code word lejigth of </>, 



m 



(13) 



is minimal,. and 



(Hi) the distance 



r 



d{F,r) = Y,\fj-f*i\ 



(14) 



is minimal, where 9 > 1 is a real number. 



10 The Distribution Optimization Data Compression (DODG) problem 
involves developing an encoding scheme in which d{T ) J 7 *) is arbitrarily 
close to zero. 

Given the source alphabet, the probabilities of occurrence of the source 
symbols, and the encoding scheme, each fj can be calculated without 
15 encoding A', as stated in the following formula for j = 1, . . . , r : 




(15) 
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where aj occurs Uji times in for i = 1, . . . , m and j = 1, . . . , r. 

It is interesting to note that the problem is intrinsically difficult be- 
cause the data compression requirements and the output probability re- 
quirements can be contradictory. Informally speaking, if the occurrence 
5 of c 0' is significantly lower than the occurrence of T in the uncompressed 
file, designing a data compression method which compresses the file and 
which simultaneously increases the proportion of 'O's to Ts is far from 
trivial. 

The importance of the problem is reflected in the following paragraph 
10 as originally stated in Hankerson et al. (Introduction to Information 
Theory and Data Compression, CRC Press, (1998)): 

" It would be nice to have a slick algorithm to solve Problem 1, es- 
pecially in the case r = 2, when the output will not vary with different 
definitions of d(T,T*). Also, the case r = 2 is distinguished by the fact 
15 that binary channels are in widespread use in the real world. 

We have no such algorithm! Perhaps someone reading this will supply 
one some day..." 
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Appendix B 

Proof of Theorems 

Theorem 1 (Convergence of Process DJ5_HJB 2 ,2)- Consider a sta- 
tionary, memory less source with alphabet S = {0, 1}, whose proba- 
5 bilities are V = [p,l - p], where p > 0.5, and the code alphabet, 
A = {0, 1}. If the source sequence X = x[l] ) . . . , x[n] y . . with x[i] € 
S, i = l,...,n, is encoded using the Process D-SJHJE^ so as to 
yield the output sequence y = y[l], . . . , y[n]> . . such that y[i] £ X, 
i = 1, . . . , n, . . ., then 

10 

lim Pr[/(n) = /*] = 1 , (16) 

n— >-oo 

where /* is the requested probability of 0 in the output (1 -p < f* < p), 
and f(n) = with c 0 (n) being the number of 0's encoded up to time 
15 n. 

Proof. By following the sequence of operations executed by the Process 
D_S_H_E2 5 2, and using the fact that the most likely symbol of <S is 0, 
y[n + 1] is encoded as follows: 
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y[n + 1] = < 



0 if x[n + 1] = 0 and f(n) < f or 

x[n + 1] = 1 and f(n) > f 

1 if x[n + 1] = 0 and /(n) > / or 

x[n + l] = 1 and /(n) < /, 



(17) 



where f(n) is the estimated probability of 0 in the output, obtained 
5 as the ratio between the number of O's in the output and n as follows: 

/(») = a £ 1 - 

For ease of notation, let /( n ~t-l)|^( n ) denote the conditional probability 
Pr[y[n + 1] = 0 | /(n)], which is, indeed, the probability of obtaining 0 in 
the output at time n + 1 given /(n). Using (17) and the fact that f(n) 
10 is given, f(n + l)|/ (n) is calculated as follows: 



/(»+!)!/(»)=< 



^^gfi if a;[n+l] = 0 and f(n) < f* or 

x[n + 1] = 1 and /(n) > /* 

^ if x[n+l} = 0 and f{n)>f* or 

z[n+l] = l and /(n).</*, 



(18) 



where co(n) is the number of O's in the output at time n. 

Since f(n) = it implies that co(n) = nf(n). Hence, (18) can be 
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re- written as follows: 



n/(n)+l 
n+1 



n+1 



if x[n + 1] = 0 and /(n) < /* 

x[n+ 1] = 1 and f(n) > /* 

if x[n + 1] = 0 and / (n) > /* 

x[n + 1] = 1 and /(n) < /* 



or 



or 



(a) 
(b) 
(c) 
(d) 



(19) 



Since (16) has to be satisfied, the behavior of (19) is analyzed for an 
5 arbitrarily large value of n. To be more specific, the analysis consists of 
considering the convergence of f(n + l)|/( n ) as n — > oo. This is accom- 
plished by considering three mutually exclusive and exhaustive cases for 
f(n). 

The first one, deals with the situation in which f(n) = /*. As n -> oo, 
10 it is shown that /* is a fixed point That is, f(n + l)|^( n ) = ^ is increased 
by an increment whose limit (for n oo) is 0, and hence f(n + l)|^ n j = y. 
will stay at a fixed point, /*. 

The second case considers the scenario when f(n) < /*. The Process . 
DJ3-H-E2,2 increments f(n + l)|/( n )</* w ^h a value proportional to the 
15 difference p — f(n) } causing f(n) to rapidly converge towards /* in the 
direction of p. As it approaches p from /(n), it has to necessarily cross 
the value /*, since /(n) < /*. It will be shown that this is done in a 
finite number of steps, fc, for which f(n + &)|/( n )</* > /*« 

On the other hand, when f(n) > /*, f(n + l)l/( n )>/* decreases. This 
20 decrement is by an amount proportional to the difference f(n) - (1 — p). 
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This, in turn, causes f(n) to rapidly converge towards 1 — p, and as 
before, it intersects the value /*. 

While the above quantities are proportional to p — f(n) and (/(n) - 
(1 - p)), respectively, it will be shown that they are both inversely pro- 
5 portional to n + 1. 

These two convergence phenomena, namely those of the second and 
third cases, cause f(n+ l)|/( n ) to become increasingly closer to /* as n 
is large, until ultimately f(n) is arbitrarily close to /*, whence the fixed 
point result of the first case forces f(n) to remain arbitrarily close to /*. 
10 Observe that the combination of the convergence properties of the 
three cases yields a strong convergence result, namely convergence with 
probability one. The formal analysis for the three cases follows. 

(0 /» = /*- 

The analysis of the asymptotic property of the probability of 
15 y[n + 1] = 0 given f(n) = /* is required. In this case, this is 

calculated using (19. (a)) and (19. (d)), and the fact that Pr[x[n + 
1] = 0] = p, as follows: 

_ nf(n) + p 
n+1 

nf( n ) + f( n ) , P-f{n) 
20 ~ n+1 + n+1 
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= >w +? rzr (20) 

n + 1 



Since, in this case, f(n) = /*, (20) results in: 



/(«+l)l/ ( „ W . = /*+^J- . (21) 



This implies that /(n + l)|/ (n)=:/ * -> /*, as n -» oo and'/* is a 
fixed point. Therefore, the following is true: 

^W\n + l)| /(B ^ = r] = l- (22) 



10 

Equation (22) is a very strong statement, since it implies that, as 
n -» oo, once /(n) arrives at /*, it stays at /* with probability 
one. 

(it) /(n) < /*. 

15 In this case, the conditional probability of y[n + 1] = 0 given 

/(n) < /*, is calculated using (20). as follows: 



/(« + l)l/(„X/- = /W + £ ^7 1 - (23) 
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Since f(n) < f* and /* < p, by our hypothesis, it is clear that 
f(n) < p. Hence: 

f(n + l)\ Hn)<f .=f(n) + A, (24) 



where A = H_iM > 0 for all finite values of n. 

Prior to analyzing the number of steps, k, required to ensure 
that f(n + fc)|/ (n)</ . becomes greater than or equal to /*, a 
closed-form expression for f(n + k)\j {n)<f , is first derived.. Since 
10 (23) determines f(n + l)|/ (n)</ ., it will be shown below that 

f(n+k)\ f( n)</ . can be calculated as follows: 



k 



P ~ f(n) 



f( n + k )\m<r = /(») + L n + fc J • (25) 



Equation (25) is proven by induction on k. 

Basis step: Clearly, (25) is satisfied for k = 1 by the fact that 
(23) is true. 

Inductive hypothesis: Suppose that (25) is satisfied for any j > 1. 
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That is: 



3 \P ~ /(«) 
f(n + j)\ Hn)<f .=f(n)+ L n+ . 



(26) 



Inductive step: It is required to prove that (25) is true for j + L 
Substituting n for n + j in (23): 



/(» + i + 1) !/,»,</• = /(» + i) + ^rfrr ■ (27) 



10 Using the inductive hypothesis, (26), (27) can be rewritten as 

follows: 



+ * + %<»)</. = 



15 



= /(n) + 



J +- 



n + j 



?/ \ ip i/( n ) 



n + j + 1 

V f{n) 



n + j n + j n + j + 1 n + j + 1 
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JP 



3f{n) 



= f(n) + 



/» + 

f(n) + 



{n + j)(n + j + l) ( n + j)(n + j + l) 
n 3P + 3 2 P +{n + j)p 
{n + j){n + j + l) 

njm+j 2 f(n) + {n + j)f(n) 
(n + j)(n + j + 1) 

JP + P jf{n) + f(n) 
n+j+l n+j+1 

U + 1) [p - /(n)" 



n + j + 1 



(28) 



(29) 



10 



Hence, (25) is satisfied for all k > 1. 

On analyzing the conditions for k so that f(n + k) \f( n )<f > /*> 
it is seen that the following must be true: 



/(») + 



P ~ f(n) 
n + k 



>f* 



(30) 



Subtracting /(n) from both sides 



15 



P ~ f{n) 
n + k 



>f*- f{n) 



(31) 
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Multiplying both sides by > 0 and dividing by /*-/(n) > 0, 
(31) results in: 

p-/(n) >!L +fc = n 

/*-/(n)~ * * + (32) 



Solving for & yields: 



k * —57^ • (33) 

P-/(n) _ -i v ' 

/*-/(») 



10 Hence, for a finite n, after A; steps (A; finite), (30) is satisfied with 

probability one. More formally, 

Pr[/(n + k)\ f(n)<f , > f*] = 1, k < oo . (34) 

Observe that the convergence implied by (34) is very strong. It 
15 implies that f(n + fc)|/ (n)</ . will be above (or equal to) /* in a 

finite number of steps, k, with probability one. Moreover, since 
the increment is inversely proportional to n + 1, f(n) converges 
quickly towards /* for small values of n, and will stay arbitrarily 
close to /*, as n -> oo with probability one. 

20 Since f(n+k) |/( n)</ . will sooner or later exceed /*, the discussion 

of this case is concluded by noting that when f(n + &)!/(„)</. > 
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10 



15 



/*, the quantity f(n + k + l)|/ (n)</ . has to be calculated in a 
different way since the hypothesis of Case (it) is violated. Indeed, 
in this scenario, the hypothesis of Case (Hi) is satisfied, and the 
following analysis comes into effect. Finally, if f(n+k)\^ n)<f . = 
/*, the quantity f(n + k + l)|/ (n)</ . is calculated using (21) as 
in Case (i), converging to the fixed point, /*, as n -> co. 

(in) f(n) > f*. 

In this case, the conditional probability of y[n + 1] = 0 given 
f(n) > /*, is calculated using (19. (b)) and (19. (c)), and the fact 
that Pr[z[n + 1] = 0] = p, as follows: 

ft . im n f( n ) + x nf(n) 

^ +1 )l/w>/- = "y+T~ ( p) + '^+1 p 

_ nf(n) + l-p 
n+1 

= nf(n) + f(n) (l-p)-/(n) 
n+1 n+1 

- - /(ra) : + (1 r p) ™ 



Since f(n) > f* and /* > 1 - p, it means that /(n) > 1 - p, 
which implies that f(n) - (1 - p) > 0. Thus: 

/(^ + l)l/(„)>/. = /(n)-A, (36) 
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where A = f ^ n \+\ p ^ > 0 for n finite. Hence, as n grows, f(n) 
decreases by a positive value, A. 

A closed- form expression for f(n + &)l/( n )>/* * s now derived. In- 
deed, it is shown by induction that if /( n +l)|/( n )>/* * s calculated 
as in (35), f(n + fc)|/( n )>/* can be calculated as follows (k > 1): 



/(* + *)!/(„»/. = - 



/(n)-(l-p) 



(37) 



10 



Basis step: Equation (37) is "clearly satisfied, by virtue of the 
fact that (35) holds. Hence the basis step. 

Inductive hypothesis: Assume that (37) is true for j > 1. That 



is: 



15 



/( n + i)l/(n)>/. = /(")- 



j[f(n) ~(l-p) 



n + i 



(38) 



Inductive step: It is shown that (37) is satisfied for j + 1. Sub- 
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10 



stituting n for n + j in (35) results in: 



Hn + i + % f = /(» + i) - /( ":; ) ; + (1 1 - p) . (39, 



Using the inductive hypothesis, (38), (39) can be re- written as 
follows 8 : 



/(» + '" + %(»)>/. = 



f(n) 



j [/(n) - (1 - p)] 



?/ n j[/(n)-(l-p)] , v 



n + j + 1 

= f(n)-iM | Jd-P) Hn) (1-P) 

, j/frO i(i-p) 

(n + j)(n + j + l) (n + j)(n + j + l) 

= r /„x + J 2 /(») + (n + j)/(n) 

15 ~ A J (n + j)(n + j + l) 

| nj(l - p) + j 2 (l - g) + (n + j)(l - p) 

• (n + jf)(n + i + l) 

8 The algebra for this case mirrors the algebra of Case (ii). It is not identical, though, 
since p and 1 - p interchange places, and the sign for f(n) changes from being positive 
to negative in various places. The proof is included in the interest of completeness. 
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= f(n) MM±M i i(i-p) + g-p) 

n+j+l n+j+1 

it x . 0 + l)f/(n)- (l-p)l 
= /(») 4—71 i - (41) 



Therefore, (37) is satisfied for all k > 1. 

Observe now that /(n) monotonically decreases till it crosses /* 
To satisfy f{n+k)\j {n)>f , < /*, the following must be true: 



M ~ n + k 



(42) 



10 Subtracting /(n), and reversing the inequality, yields: 



/W --r^^ [/(»)- (i-p) 



(43) 



Multiplying both sides by 2j* > 0 and dividing by f(n)-f* > 0, 
15 (43) results in: 
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f{n) - (1-p) > n + k = n 
f(n)-f* ~ k k + 



.(44) 



10 



As before, solving for k, yields: 



k> 



/(n)-(l-p) _ 1 



(45) 



/(n)-/' 



Thus, for a finite n, after A; steps (A; finite), (42) is satisfied with 
probability one. That is: 



Pr [/(" + %(„)>/• </*] = !• (46) 



As in the case of (34), equation (46) represents a strong conver- 
gence result. It implies that, starting from f(n) > /*, f(n + 
ty\f(n)>f wil1 be below (° r equal to) /* in a finite number of 
steps, k, with probability one. Moreover, since the decrement is 
inversely proportional to n + 1, f(n) converges quickly towards 
/* for small values of n, and will stay arbitrarily close to /*, as 



138 



SUBSTITUTE SHEET (RULE 26) 



WO 03/028281 



PCT/CA01/01429 



n — > oo with probability one. 

Since f(n + k)\j {n)>f . will sooner or later become less than or 
equal to /*, this case is terminated by noting that when f(n + 

k )\f(n)>r < f*> the q uantit y f( n + k + 1 )l/(n ) >/« has to be cal - 

5 culated in a different way since the hypothesis of Case (Hi) is 

violated. At this juncture, the hypothesis of Case (ii) is sat- 
isfied, and the previous analysis of Case (ii) comes into effect. 
Clearly, if /(* + *)!/(„)>/• = f\ the quantity f(n + k + l)\j {n)>f . 
is calculated using (21) as in Case (i), converging to the fixed 

10 point, /*, as n -4 oo. 

To complete the proof, the analyses of the three cases are combined 
to show that the total probability /(.), converges to /*. 
From (34) , there exists m> n such that 



15 Pr[/H| /{n)</ .>n = l. (47) 

This means that /(m)| / - (n)</ . > /*, but /(m+l)|; (n)</ . < f{m)\f{ n )<r> 
since f{m)\^ n)<f . obeys (35). Equation (35) yields : 

20 f(m)\ Hn)<J .=r±S, (48) 
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where 5 -¥ 0 as n -» oo. Hence, there exists no > m, and e arbitrarily 
.small such that: 



lim Pr[|/(no)|/ (n)</ .-ri<€] = l. (49) 

In the contrary scenario of (46), it is known that there exists m > n 
such that: 



10 Pr[/(m)| /(n)>/ .<r] = l. (50) 

Hence f{ m )\f(n)>f* < /*« but /( ro + 1 )l/(n)>/- > /Hl/(n)>/- From 
(23), it can be seen that: 

15 f(m)\ hn)>ft = r±5, (51) 

where 5 — > 0 as n — > oo. Again, there exists no = m, and e arbitrarily 
small such that: 
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lim Pr[!/(no)l /(n)>/ .-/i<e] = l. (52) 



T10-4OO 



Combining (22), (49), and (52), yields: 



lhnPr[/(t)| /(n)=/ . = n = 1, (53) 
HmPr[/(t)| /(n)</ . = f] = l.and (54) 
limPr[/(t)| /(n)>/ .==r] = 1. (55) 



Prom the three cases discussed above, the total probability f(t) = f* 
10 is evaluated as: 

Pr[/W = r] = Pr[/(i)| /(n)=/ . = r]Pr[/W = r] + 
Mf(t)\f {n)<f . = r) Pr[/(n)</1 + 

pr[/wi /( n)>/« = r] pr [/>)>/i- ( 56 ) 

15 

Since the conditioning events are mutually exclusive and collectively 
exhaustive, (53), (54), (55), and (56) are combined to yield: 
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lim Pr[/(t) = J*} = 1 , (57) 



and the result is proven. □ 

5 Theorem 2 (Rate of Convergence of Process D_S_HJE 2 ,2)- If/* is 
set at 0.5, then E[/(l)]=0.5, for Process D_S -H-E 2j2 , implying a one-step 
convergence hi the expected value. 

Proof. From (23) it can be seen that f(n + 1) has the value given below 



/(«+i)i/»</.=/(«)+^ 1 - < 5s ) 



whenever f(n) < /*. 

But independent of the value set for /(0) in the initialization process, 
this quantity cancels in the computation of /(l), leading to the simple 
15 expression (obtained by setting n = 0) that 

/(l) = p whenever /(0) < 0.5 . 
Similarly, using expression (35), it is clear that f(l) has the value 

I —p whenever f(0) > 0.5 . 
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Initializing the value of /(0) to be less than /* with probability s, 
leads to : 

f(l) = spf(0) + sp[l - /(0)] , 
which can be expressed in its matrix form as follows: 



/(l) 




sp s(l -p) 


T 


/(0) 


.1-/(1). 




Sp 5(1— p) 




_l-/(0)_ 



(59) 



The above, of course, implies that the value of f(0) can be initialized 
to be greater than /* with probability 1 — 5. Thus, 

/(l) = (1 - - p)/(0) + (1 - s)p[l - /(0)] , 
which can also be expressed using the matrix form as follows: 



' /(I) 




"(l-s)(l 


-p) 


(1 - s)p 


T 


/(0) 


1-/(1) 






-p) 


(1 - s)p _ 




_ i - /(0) _ 



From (59), (60), and using £{•) to denote the vector [/(•), 1 - /(•)], 
the expected value of /(l) and 1 - /(l) becomes: 
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e [^(i)] = m t e [m] , 



(61) 



where M is the Markov transition matrix given by: 



ap+(l-«)(l-p) s(l-p) + {l-s)p 

S P + (1 - s)(l - p) 8(1 ~ p) + (1 - S) P 



(62) 



Calculating the eigenvalues of M, yields A x = 1 and A 2 = 0, and hence 
the Markov chain converges in a single step!! 
10 Therefore, if /(0) is uniformly initialized with probability s = 0.5, 
E[/(l)] attains the value of /* = 0.5 in a single step. □ 

Theorem 3 (Convergence of Process D_S_HJE m)2 ). Consider a sta- 
tionary, memoryless source with alphabet S = {si, . . . , s m } whose prob- 
abilities are V = [pi, . . . ,p m ], the code alphabet A = {0, 1}, and a binary 
15 Huffman tree, T, constructed using Huffman's process. If the source se- 
quence X = x[l] . . . x[M) is encoded by means of the Process D_S_H_E m , 2 
and T, generating the output sequence y - y[l] . ..y[R], then 



lim Pr[/(n) = /*] = !, 



n-foo 



(63) 
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where /* is the requested probability of 0 in the output (1 - f max < f* < 
fmax), and f(n) = SaM with co(n) being the number of O's encoded up 
to time n. 

5 Proof. Assume that the maximum number of levels in T is Since 
different symbols coming from the input sequence are being encoded, it 
is clear that the level j is not reached for every symbol. Observe that 
the first node of the tree is reached for every symbol coming from X - 
which is obvious, since every path starts from the root of T. Using this 

10 fact the basis step is proved, and assuming that the fixed point, /*, is 
achieved at every level (up to j - 1) of T, the inductive step that /* is 
also attained at level j can be proven. 

Basis step: It has to be proved that the fixed point /* is achieved at the 
first level of T. Note that the root node has the associated probability 

15 equal to unity, and hence its left child has some probability p L > 0.5 
(because of the sibling property), and the right child has probability 
1—pi. This is exactly a Huffman tree with three nodes as in Theorem 1. 

Depending on the condition f(n) < /*, the value of f(n) will be 
increased (Case (ii) of Theorem 1) or decreased (Case (iii) of Theorem 1) 

20 by Ai as given in (24) or (36) respectively. Whenever f(n) asymptotically 
reaches the value of /*, it will stay at this fixed point (Case (i) of Theorem 
1). By following the rigorous analysis of Theorem 1, omitted here for the 
sake ofbrevity, it is clear that f(n) asymptotically attains the fixed point 
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/* as n -> oo, w.p.l. 

Inductive hypothesis: Assume that the fixed point /* is attained by 
Process D_S_HJE mj2 at level j - 1 > 2, for n — > oo with probability unity. 

Inductive step: It is to be shown that the fixed point /* is asymptot- 
5 ically achieved at level I = j. Since the level j may not reached for all 
the symbols coming from the input sequence, the decision regarding the 
assignment 0-1 or 1-0 for these symbols must have been taken at a level 
k < j. In this case, the argument is true for j, because of the inductive 
hypothesis. 

10 For the case in which the level j is reached, the decision concerning 
the branch assignment is to be taken in a subtree, Tj, whose root (at level 
j) has weight m, and its two leafs have probabilities pj 1 and pj 2 . After 
normalizing (i.e. dividing by w), the root of Tj will have an associated 
normalized probability 1, the left child has the associated normalized 

15 probability pj = 2k, and the right child has probability 1 - Pj = ^, 
where pi > 0.5 because of the sibling property. 

Since the decision on the branch assignment is being taken at the root 
of Tj, if f(n) < /*, the value of /(n) will be increased by an amount Ay 
as given in (24). This corresponds to Case (ii) of Theorem 1. On the 

20 other hand, if f(n) > /*, f(n) will be decrease by Ay as given in (36), 
which corresponds to Case (iii) of Theorem 1. Finally, if f(n) = /*, f(n) 
will asymptotically stay at the fixed point /*(Case (i) of Theorem 1). As 
in the basis step, by following the rigorous analysis of Theorem 1, which 
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is omitted here for the sake of brevity, it follows that f(n) asymptotically 
achieves the fixed point /* as n -» oo, w.p.l. 

The induction follows and the theorem is proved. □ 

Theorem 4 (Convergence of Process D_AJ3_E 2 ,2). Consider a 
5 memoryless source with alphabet <S = {0, 1} and the code alphabet, 
A = {0, 1}. If the source sequence X = x[l], . . . , x[n], . . ., with x[i] G <S, 
i = 1, . . . , n, . . is encoded using the Process Process D_A_HJE 2> 2 so as 
to yield the output sequence y = y[l], . . . , y[n], . . ., such that y[i] G A, 
i = 1, . . . ,n, . . then 

10 

Jim Pr[/(n) = /*] = !,. (64) 

where /* is the requested probability of 0 in the output, and f(n) = ^ 
with co(n) being the number of 0's encoded up to time n. 

15 Proof. It is to be shown that the fixed point /* is asymptotically achieved 
by Process D_A_H_E 2)2 as n oo, with probability unity. Observe that 
at each time instant, Process D_A_H_E 2 ,2 maintains a Huffman Tree T n , 
which has the sibling property. T n has the root, whose probability or 
weight is unity, the left child with probability p(n), and the right child 

20 whose probability is 1 - p(n), where p(n) > 0.5 because of the sibling 
property. 

The details of the proof follow the exact same lines as the proof of 
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Theorem 1. They are not repeated here. However, it is clear that at 
each time instant l n\ there are three cases for f{n): 

(i) f(n) = /* : In this case, as n increases indefinitely, f(n) will 
stay at the fixed point /*, by invoking Case (i) of Theorem 1, 

5 where the probability of 0 in the input at time V is p(n). 

(ii) f(n) < f* : In this case, the value of f(n) will be increased by 
A, as given in (24). This corresponds to Case (ii) of Theorem 1. 

(iii) f(n) > f* : In this case, the value f(n) will be decreased by A, 
as given in (36). This corresponds to Case (iii) of Theorem 1. 

10 With these three cases, and mirroring the same rigorous analysis done 
in Theorem 1, one can conclude that the estimated probability of 0 in 
the output, /(n), converges to the fixed point /*, as n oo. This will 
happens w. p. 1, as stated in (64). □ 

Theorem 5 (Convergence of Process D_AJEUE mj2 ). Consider a 
15 memoryless source with alphabet S = {si,...,s m } whose probabili- 
ties are V = [pi, . . . ,p m ], and the code alphabet A = {0, 1}. If the 
source sequence X = x[l] . . .x[M] is encoded by means of the Process 
D_A_HJE mi 2, generating the output sequence ^ = y[l] . . then 

20 lim Pr[/(n) = /*] = 1 , (65) 

n—too 
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where /* is the requested probability of 0 in the output (1 - / maa . < /* < 
fmax), and f(n) = with c 0 (n) being the number of O's encoded up 
to time n. 

5 Proof. Considering an initial distribution for the symbols of <S, a Huffman 
tree, 7o, is created at the beginning of Process D_A_H_E m)2 . As in Process 
D_S_H_E m)2 , it is assumed that the maximum level of 7* is / = j. This 
level will not be reached at all time instants. However, the first node 
is reached for every symbol coming from X } because every path starts 

10 from the root of 7^. Assuming that the fixed point /* is achieved at 
every level (up to j — 1) of 7fc, it is shown by induction that /* is also 
asymptotically attained at level j. 

Basis step: The root node of % has the normalized weight of unity, its 
left child has the normalized weight p(k), and hence the right child has 

15 the normalized weight 1 — p(k), where p(k) > 0.5 because of the sibling 
property. There are again three cases for the relation between f(n) and 
f - 

(i) Whenever /(n) = /*, f(n) will asymptotically stay at the fixed 
point /*as n — > oo. 

20 (ii) If f(n) < /*, f(n) will be increased by A x , .as in Case (ii) of 
Theorem 4. 

(Hi) If f(n) > /*, f(n) will be decreased by Ai (Case (iii) of Theorem 
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4)- 

Prom these three cases and from the analysis done in Theorem 4, it 
can be shown that f(n) will converge to /* as n -» oo w. p. 1, and hence 
(65) is satisfied 

5 Inductive hypothesis: Assume that the fixed point /* is attained w. 
p. 1 by D^AJHJE mj 2 &t aziy level j — 1 > 2 of 7^, for n -» oo. 

Inductive step: In order to show that /* is asymptotically achieved at 
level I = two cases are considered. The first case is when the level j 
is not reached by the symbol being encoded. In this case, the labeling 
10 strategy was considered in a level i < j, for which the result follows from 
the inductive hypothesis. 

For the case in which the level j is reached, the decision on the labeling 
is being taken in a subtree, Tk p whose root has the estimated probability 
w[k), and its two leafs have estimated probabilities pj^k) and pj 2 (k). 
15 After normalizing (i.e. dividing by w(k)), the root of %. has probability 
unity, the left child has probability Pj{k) = and the right child has 
probability 1 — Pj(k) = where Pj{k) > 0.5 because of the sibling 

property. 

Again, there are three cases for f(n) for the local subtree. When 
20 f(n) = /*, /(n) will asymptotically stay at the fixed point /*. Whenever 
f(n) < /*, f(n) will be increased by Aj(fc), calculated as in (24), which 
is positive, since it depends on pj{k) > 0.5. If /(n) > /*, f(n) will be 
decreased by Aj(fc) (again positive, because pj (k) > 0.5) as given in (36). 
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Prom these three cases and the analysis done in Theorem 5, it follows in 
an analogous manner that Process D_A_H_E m> 2 achieves the fixed point 
/* as n -> oo w. p. 1, as stated in (65). □ 

Theorem 6 (Convergence of Process RV_AJH_E 2 ,2)- Consider a 
5 memoryless source whose alphabet is S = {0, 1} and a code alphabet, 
A = {0, 1}. If the source sequence X = x[l], . . . , x[n], . . ., with x[i] € 
S, i = 1, . . . , n, . . ., is encoded using the Process RV_A_H_E 2 ,2 so as to 
yield the output sequence y = y[l], . . . ,y[n], . . ., such that y[i] G A, 
i = l,...,n, then 

LO 

lim E[/(n)] = t , and (66) 

n— kx) 

lim Var[/(n)] = 0, (67) 

n->oo 



where /* = 0.5 is the requested probability of 0 in the output, and 
15 f(n) = w ith co(n) being the number of O's encoded up to time n. 
Thus f(n) converges to /* in the mean square sense, and in probability. 

Proof. The estimated probability of 0 in the output at time 'n + 1' given 
the value of /(n), f{n + 1)1^, is calculated as follows : 



. 2aM if y[ n + 11 = 0 
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where f(n) is the estimated probability of 0 in the output, obtained as 
the ratio between the number of O's in the output and n as f(n) = Sijpl. 
On the other hand, the output of Process RV_AJH_E 2j 2 at time n + 1, 
5 is generated as per the following probabilities: 



0 with probability [1 - f{n))p{n) + f(n)(l - p(n)) 
y[n+l]= \ „ A (69) 

1 with probability [1 — — p( n )) + f( n )p{ n ) > 



where p(n) is the estimated probability of 0 in the input at time l n\ 
10 and p(n) > 0.5 because of the sibling property. 

Unlike in the convergence analysis of Process D_SJH JE 2) 2 in which the 
exact value of f(n + l)|/( n ) was calculated, here the expected value of 
f(n + l)|/( n ) is computed. For the ease of notation, let /(n + 1) be the 
expected value of f(n + 1) given f(n). Replacing the value of co(n) by 
15 nf(n) in (68), and using (69), all the quadratic terms disappear. Taking 
expectations again f(n '+ 1) is evaluated as: 



nf(n) + 1 

^{[1 -/(»)](! »P(n))+^ (70) 



fin + 1) = ± {[1 - [(n)]p(n) + /(n)(l - p(n))} 

, n f( n ) 



n + 



20 



152 



SUBSTITUTE SHEET (RULE 26) 



WO 03/02X281 



PCT/CAO 1/01 429 



After some algebraic manipulations, (70) can be written as: 

f(n + 1) = /(n) + . (71) 

5 The convergence of (76) can be paralleled by the convergence results 
of Theorem 1 with the exception that instead of speaking of /(n) the con- 
vergence of E[f(n)] is investigated. This proof is achieved by analyzing 
three mutually exclusive cases for £(n) . 

w /(») = /*• 

10 In this case. Equation (71) asymptotically leads to /(n+l) = /*. 

This implies that if f(n) is at the fixed point /*, it will remain 
there as n — » oo, and hence (66) is satisfied. 

(«) /(n) < /*• 

In this scenario, equation (71) can be written as: 

15 

/(n + 1) = /(n) + A , (72) 

where A = fMMZMl > 0, since /(») < /*. 

Note that (72) is similar to (24), where A > 0 again implying 
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that /(n) increases at time V towards /*. 

(in) [(n) >f*. 

For the analysis of this case, (71) is expressed as below: 



5 /(n+l) = /(n)-A, (73) 

where A = «^hil > 0 , because /(n) > /* . 

For this case, (73) is analogous to (36), where A > 0 except that 
/(n) replaces the term /(n) of (36). 

10 With these three cases, and following the rigorous analysis of Theorem 
1 (omitted here for the sake of brevity), it can be seen that 



Urn E[/(n)] = /* . (74) 

n-+oo 

15 Thus /(n) is a random variable whose expected value , /(n) , converges 
to the desired probability of 0 in the output, /*, as n co. 

To analyze the asymptotic variance of this random variable, f(n), the 
calculation is as follows: 
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Var[/(n + 1)] = E[/ 2 (n + 1)] - E[/(n + l)] 2 . 



(75) 



To evaluate E 



jf2( n .+ 'l)l ? the expression for the conditional expecta- 



tion E J/ 2 (n + l)|^ n) from (68) is written as follows: 



E [f\n + l)| /(n) ] = f^f 1 ) {[1 " f(n))p(n) + - p(n)]} 

^) {[1 - /»][! " + /(«)#*)} ( 76 ) 



10 On simplification, it turns out that all the cubic terms involving f(n) 
cancel (fortunately). Taking expectations a second time yields E[/ 2 (n + 
1)] to be a linear function of E[/ 2 (n)], E[/(n)] and some constants not 
involving f(n). More specifically, if /(n) represents E[/(n)], 



15 



/> + !)=. 



(n+1) 



i— - (/(n)[n 2 - 2n + 4np) + [2p - 1 + 2n - 2np]/(n) + (1 - p)\ (77) 
f lr I— J 



155 



SUBSTITUTE SHEET (RULE 26) 



WO 03/028281 PCT/CAO I/O 1429 

Normalizing this equation by multiplying both sides by (n+ 1) 2 yields 



(n + l) 2 [/V + l)] = 

5 

[f(n)[n 2 -2n + 4np] + [2p - 1 + 2n - 2np]/(n) + (1 - p)} . (78) 

There are now two situations, namely the case when the sequence 
converges and alternatively, when it does not. In the latter case, /(n) 
10 converges slower than n 2 . Taking the limit as n — > oo, this leads to the 
trivial (non-interesting) result that /(oo) = /(oo). The more meaningful 
result is obtained when f(n) converges faster than n 2 . In this case, 
expanding both sides and setting fjn) to its terminal value /(oo) (which 
is ultimately solved for, and hence moved to the LHS), yields : 

15 

lim [n 2 + 2n + 1 - n 2 + 2n - Anp]f (oo) = (79) 

n->oo — 

lim [2p - 1 + 2n - 2np]/(oo) + (1 - p) . (80) 

n-4oo — 

20 Observe that the n 2 terms again cancel implying that a finite limit for 
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the solution exists. This leads to : 



; 2 2p-l + 2n-2np 7 

/ (oo) = lim — — /(oo) . (81) 



5 or rather : 



/ 2 (oo)= lim 2 ^~4f (oo) = 0.25. (82) 
- n->oo 4n(l — p)— 



Since /(oo) is 0.5 and / 2 (oo) is 0.25, it follows that Var[/] is zero. Thus 
10 /(n) converges to a random variable with mean /* and variance zero, 
implying both mean square convergence and convergence in probability. 
Hence the theorem. □ 

Theorem 7 (Convergence of Process RV J3 JE mj2 ) . Consider a 
memoryless source whose alphabet is S = {si, . . . , s m } and a code al- 
ls phabet, A = {0, 1}. If the source sequence X = sc[l], . . . , x[M], . . ., with 
x[i] 6 <S, i = 1, . . . , M, . . is encoded using the Process RV_A_H_E m> 2 so 
as to yield the output sequence y = y[l], . . . , y[R] : . . such that y[i] 6 -A, 
i = 1, . . . , R, . . then 



20 



lim E[/(n)] = /*, and (83) 
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Urn Var[/(n)] = 0, (84) 

where /* = 0.5 is the requested probability of 0 in the output, and 
f(n) = with c 0 (n) being the number of 0 V encoded up to time n. 
5 Thus /(n) converges to /* in the mean square sense and in probability. 

Proof. As per Process RV_A JELE mi 2, a Huffman tree, 7q, is initially cre- 
ated based on the initial assumed distribution for the symbols of <S . As 
in the proof of convergence of Process DJV_H_E m> 2 (Theorem 5), it is 
again assumed that the maximum level of Tk is / == j. This level may not 

10 be reached all the time instants. However, the fist node is reached for 
every symbol coming from X , because every path starts from the root of 
Tk- The proof of Theorem 6 now completes the proof of the basis case. 

. For the inductive argument, the inductive hypothesis is used, i.e., that 
the fixed point /* is achieved at every level (up to j — 1) of 71- . By arguing 

is in a manner identical to the proof of Theorem 5 (omitted here to avoid 
repetition), the result follows by induction on the number of levels in the 
tree 7I-. □ 
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Appendix C 
A Simple Encryption Useful for DODE+ 

This Appendix contains a straightforward encryption that can be used 
in DODE+ as a special Case of E NC p reserve . 

5 Suppose that the input sequence X is encoded by using any deter- 
ministic or randomized embodiment for DODE, generating an output 
sequence y = y[l] . ..y[R] } where y[i) G A = {0, 1}. y is parsed using 
m = 2 k source symbols, <S = {so,.. . • ,s m -i}> whose binary representa- 
tions are obtained in a natural manner (i.e. from 0000... to 1111...). The 

10 output, y' y is generated by encoding each of the R div k symbols, Sj, of 
y as the corresponding w% obtained from a particular encoding scheme, 
$ : S = {so, . . . , 5 m _i} -¥ C = {^0) • . • , ^m-l}- The remaining i? mod 
bits can be encoded by using a similar mapping for a smaller number of 
bits. 

15 Notice that there are m\ different encoding schemes, and that each 
encoding scheme has a unique key, /C, which is an integer from 0 to 
ml — 1. The Process to transform a given key into a mapping consists of 
repeatedly dividing the key by a number, Z, which starts at m! and is 
decremented to unity. The remainder at every stage specifies the current 

20 mapping, and the dividend is then divided by the decremented value of 
Z. 

More formally, let WjQWji . . . Wjk refer to Wj G C. 
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The mapping procedure proceeds as follows. It starts with S = 
{so> . . • , s m _i}. )C is divided by m obtaining the dividend, c?o, and the 
reminder, jo. The code word of C, u>o, is Sj Q) which is removed from S. 
Then, do is divided by m — 1 obtaining d\ and ji, and it continues in the 
5 same fashion until all the symbols of S have been removed. 

The process is straightforward and explained in standard textbooks, 
and omitted in the interest of brevity. However, a small example will 
help to clarify this procedure. 

Example 7. Let S = {000, 001, 010, 011, 100, 101, 110, 111} be the source 
10 alphabet for the case when the output obtained from DODE is y = 
011101001010110010110, and the key is K = 2451. 

The mapping construction procedure is depicted in Table 28. Dividing 
JC by 8, we obtain do = 308, and i$ = 3, and hence 53 = 011 is assigned 
to iuo; 53 is removed from S (in bold typeface in the table). Dividing 308 
15 by 7, we obtain d\ = 43 and i\ = 5, and hence S5 = 110 is assigned to 
vj\\ s 5 is removed from S (in bold typeface in the table). 

The mapping construction procedure continues in the same fashion 
obtaining the following mapping shown in Table 29 below. 

Finally, we parse y using this mapping, and hence we obtain y* — 
20 100000110001101001101. □ 

It is easy to see that if the symbols of S are equally likely (as it is, since 
the output of DODE and RDODE guarantee Statistical Perfect Secrecy), 
the key specification capabilities presented above make DODE 4 * a truly 
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The mapping procedure proceeds as follows. It starts with S = 
{so, • • • , s m -i}. & is divided by m obtaining the dividend, do, and the 
reminder, jo- The code word of C, u>o, is s? 0 , which is removed from <S. 
Then, do is divided by m — 1 obtaining d\ and ji, and it continues in the 
5 same fashion until all the symbols of S have been removed. 

The process is straightforward and explained in standard textbooks, 
and omitted in the interest of brevity. However, a small example will 
help to clarify this procedure. 

Example 7. Let <S = {000,001,010,011,100, 101,110, 111} be the source 
10 alphabet for the case when the output obtained from DODE is y = 
011101001010110010110, and the key is K = 2451. 

The mapping construction procedure is depicted in Table 28. Dividing 
/C by 8, we obtain do = 308, and %o = 3, and hence 53 =011 is assigned 
to wo] 53 is removed from <S (in bold typeface in the table). Dividing 308 
15 by 7, we obtain d\ = 43 and i\ = 5, and hence S5 = 110 is assigned to 
w\\ 55 is removed from S (in bold typeface in the table). 

The mapping construction procedure continues in the same fashion 
obtaining the following mapping shown in Table 29 below. 

Finally, we parse y using this mapping, and hence we obtain y f = 
20 100000110001101001101. □ 

It is easy to see that if the symbols of S are equally likely (as it is, since 
the output of DODE and RDODE guarantee Statistical Perfect Secrecy), 
the key specification capabilities presented above make DODE" 1 " a truly 
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Table 28: An example of a straightforward mapping construction proce- 
dure that can be used by DODE + . In this case, the key K, is 2451, and 
the encryption is achieved by transforming sequences of three symbols at 
a time. 

S binary C binary 

" 011 
110 
001 
100 
010 
000 
101 

111 

Table 29: An example of one of the 8! possible encoding schemes from S 
to C. In this case, the key K is 2451, and the encryption is achieved by 
transforming sequences of three symbols at a time. 



so 000 — > w Q 

S\ 001 ; — > VJ\ 

5 2 0 10 — > VJi 

53 011 — U>3 

54 100 — y W<1 
s 5 101 — > w$ 
s 6 110 — > Wq 

S 7 111 > W 7 
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functional cryptosystem that also guarantees Statistical Perfect Secrecy. 
Thus, the cryptanalysis of DODE+ using statistical means is impossible. 
Breaking DODE+ necessarily requires the exhaustive search of the entire 
key space. 
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